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Foreword 


The  Thirteenth  Image  Understanding  (IU)  Workshop  sponsored  by  the  Defe  ise  Advanced  Reseat ch 
Projects  Agency  (DARPA),  Information  Processing  Techniques  Office,  was  held  at  Stanford  University, 

Palo  Alto,  California,  on  15-16  September  1982.  The  Intelligent  Systems  Prog*‘em  Manager,  Cdr.  Ronald  B. 
Ohlander,  USN,  welcomed  the  more  than  one  hundred  government,  academic  and  industrial  participants  to  the 
workshop.  He  noted  that  it  was  particularly  fortuitous  that  this  meeting,  occuring  some  seventeen 
months  since  the  last  DARPA  sponsored  workshop  in  Washington,  D.C.  in  April  1981,  should  take  place  in 
the  Palo  Alto,  California  area.  The  reason  for  this  was  because  one  of  the  principal  participants 
since  the  beginning  of  the  program,  Stanford  University,  was  the  host  and  two  of  the  premier  Artificial 
Intelligence  Laboratories  in  the  country,  Stanford  University  and  SRI,  International,  are  located  in  the 
area  and  were  available  for  visits  by  workshop  participants.  In  addition,  the  cartographic  testbed  which 
has  been  established  at  SRI  International  has  progressed  at  a  very  acceptable  rate,  and  participants 
received  a  more  detailed  description,  as  well  as  demonstrations,  by  Dr.  Andy  Hanson  of  SRI  on  the 
afternoon  of  the  second  'ay  of  the  workshop.  Cdr.  Ohlander  also  welcomed  the  participation  of  the  Defense 
Mapping  Agency  (DMA)  at  the  workshop  and  noted  that  their  presentation  at  the  afternoon  session  of  the 
first  day  presented  a  view  of  DoD  technology  requirements  that  should  have  significant  impact  on  future 
research  directions. 

Section  I  of  these  proceedings  contains  outlines  of  the  program  progress  reports  as  presented 
in  Session  I  by  the  seven  principal  investigators  involved  in  the  DARPA  research  program  in  Image  Under¬ 
standing.  Session  II  of  the  workshop  consisted  of  the  DMA  discussion  entitled,  "DMA's  Long  Range  Plan 
for  Technology  Transfer  and  Its  Relationship  to  the  DARPA  Community."  This  presentation,  by  Mr.  Henry  Cook 
and  Major  Dave  Nelson  of  DMA,  was  followed  by  a  general  discussion.  No  report  of  this  session  is  contained 
in  these  proceedings.  Sessions  III  and  IV  consisted  of  eighteen  technical  papers  presented  during  the 
afternoon  of  day  one  and  the  morning  session  of  day  two.  The  papers  from  these  two  sessions  are  repro¬ 
duced  in  Section  II  of  these  proceedings.  Session  V  of  the  workshop  on  the  afternoon  of  day  two  was 
devoted  to  the  visits  to  the  Stanford  University  and  SRI,  International  Artificial  Intelligence 
Laboratories  as  described  above.  Since  time  did  not  permit  the  presentation  of  all  technical  papers 
prepared  by  the  researchers  in  the  DARPA  Image  Understanding  Program,  those  papers  not  presented  are 
reproduced  in  these  proceedings  in  Section  III.  This  procedure  was  adopted  in  order  that  a  complete 
record  be  available  to  research  and  operational  personnel  requiring  information  on  the  subjects  covered. 

This  workshop  was  hosted  by  Dr.  Thomas  0.  Binford,  Research  Associate  at  the  Artificial 
Intelligence  Laboratory  at  Stanford  University.  The  workshop  organizer  wishes  to  express  the  appreciation 
of  all  to  Dr.  Binford  for  his  invaluable  assistance  in  providing  the  necessary  arrangements  and 
facilities.  Gratitude  is  also  due  to  Ms.  Marianne  Slroker  of  the  Stanford  University  staff  who  attended 
to  all  of  the  myriad  of  details  necessary  to  arrange  for  and  conduct  the  workshop.  The  success  of  the 
workshop  is  largely  due  to  Ms.  Siroker's  efforts.  g,  support,  including  mailings,  collection  and 

arrangement  of  the  conference  proceedings,  and  ass  '  "v/Mce  for  the  workshop  in  general,  was  provided  by 
Miss  Karen  Dong,  Ms.  Charlotte  Dettor,  and  Ms.  Lau- Lilas  of  the  Science  Applications,  Inc.  staff. 
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The  materials  for  the  cover  of  this  document  were  submitted  by  Dr.  Harlyn  Baker  of  Stanford 
University.  Dr.  Baker  notes  that  the  urban  stereo  imagery  is  a  synthetic  pair  generated  by  Control  Data 
Corporation  (for  demonstrating  their  graphics  capabilities  some  years  ago),  and  shows  a  block-type 
depiction  of  roads  and  buildings  of  various  heights.  The  depth  map,  states  Baker,  was  produced  by  an 
edge  and  intensity  based  stereo  matching  process  described  in  his  paper  on  this  subject  in  these 
proceedings.  Dr.  Baker  believes  that,  in  the  future,  depth  maps  of  the  sort  depicted  here  will  be  used 
within  the  ACRONYM  modelling  and  reasoning  system  for  recognition  of  3-dimensional  structures.  Both 
pairs  of  figures  (the  Images  and  the  depth  map)  are  set  up  for  stereoscopic  viewing  with  cross-eyed 
fusion  (left  eye  see  right  image  of  pair,  light  eye  sees  left  image).  The  artwork  and  lay-out  for  the 
proceedings  cover  was  created  by  Mr.  Tom  Dickerson  of  SAI. 

Lee  S.  Baumann 

Science  Applications,  Inc. 

Workshop  Organizer 
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ABSTRACT 


Current  activities  on  the  project  are  summa¬ 
rized  under  the  following  headings: 


1. 


(a)  Preprocessing  and  segmentation 
/b)  Feature  detection  and  texture  analysis, 
/c)  Hierarchical  representations, 
fd)  Matching  and  motion  , 

Introduction 


with  a  comparative  study  of  segmentation  techniques 
as  applied  to  FLIR  imagery  [A],  and  with  the  use  of 
pyramids  for  extracting  compact  objects  from  an 
image  [5,6],  also  appear  in  the  Workshop  Proceed¬ 
ings.  Work  on  context-based  target  detection  is 
covered  in  a  report  that  also  appears  in  the  Work¬ 
shop  Proceedings  [7];  a  second  report  on  this  topic 
is  in  preparation.  Finally,  Section  5  summarizes 
work  on  image  matching  and  time-varying  imagery 
analysis;  one  paper  in  this  area  also  appears  in 
the  Workshop  Proceedings  [8], 


This  project  is  concerned  with  the  study  of 
advanced  techniques  for  the  fnalysis  of  recon¬ 
naissance  Imagery.  It  is  being  conducted  under 
Contract  DAAG-53-76-C-0138  (DARPA  Order  3206), 
monitored  by  the  U.S.  Army  Night  Vision  and 
Electro-Optics  Laboratory  (Dr.  George  Jones). 

The  Westinghouse  Systems  Development  Division, 
under  a  subcontract,  is  collaborating  on  imple¬ 
mentation  and  application  aspects. 

Work  on  the  current  phase  of  the  project  was 
initiated  in  April  1980.  Accomplishments  and 
publications  during  the  period  1  April  1980  - 
31  July  1981  are  summarized  in  two  earlier  status 
reports  [1-2],  the  first  of  which  also  appeared 
in  the  Proceedings  of  the  April  1981  Image  Under¬ 
standing  Workshop  [3],  The  present  report, 
covering  the  period  1  August  1981  -  31  July  1982, 
is  being  issued  separately  and  will  also  appear  in 
the  Proceedings  of  the  September  1982  Image  Under¬ 
standing  Workshop.  For  convenience,  publications 
since  February  1981  are  also  cited  here,  since 
they  were  not  cited  in  the  April  1981  Workshop 
Proceedings. 

The  project  is  concerned  with  three  principal 
areas:  segmentation  techniques;  context-based 
target  detection  in  FLIR  imagery;  and  analysis  of 
time-varying  imagery.  Work  in  the  first  area  is 
summarized  in  Section  2  (Preprocessing  and  segmen¬ 
tation)  and  3  (Feature  detection  and  texture 
analysis),  while  Section  4  summarizes  work  on  the 
use  of  hierarchical  image  representations 
("pyramids")  in  both  segmentation  and  feature 
detection.  Three  papers  in  these  areas,  dealing 
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2 .  Preprocessing  and  segmentation 

2.1  Comparative  segmentation  study 

A  comparative  study  of  FLIR  image  segmentation 
techniques  was  conducted,  using  a  database  of  51 
images  obtained  from  four  different  sources.  The 
techniques  compared  included  two-  and  three-class 
relaxation,  "pyramid  linking",  and  "superspike” 
(see  below).  The  results  are  described  in  detail 
in  [A]  and  in  a  paper  appearing  in  the  Workshop 
Proceedings. 

2.2  Constraint-based  region  identification 

A  context-based  approach  to  region  identifica¬ 
tion  on  FLIR  imagery  was  developed;  it  uses  con¬ 
straint  filtering  techniques  to  identify  regions 
as  (possibly)  belonging  to  the  classes  sky, 
smoke,  ground,  tank,  and  tree.  A  detailed  de¬ 
scription  of  the  approach  and  examples  of  its  use 
can  be  found  in  [7],  which  also  appears  in  the 
Workshop  Proceedings. 

2.3  Histogram-based  image  smoothing 

A  powerful  method  of  edge-preserving  image 
smoothing  known  as  "superspike"  has  been  developed 
It  is  based  on  repeatedly  averaging  each  pixel 
with  a  subset  of  its  neighbors,  where  the  neigh¬ 
bors  used  are  chosen  on  the  basis  of  their  rela¬ 
tionships  with  the  given  pixel  on  the  Image's 
histogram.  Specifically,  we  use  a  neighbor  If  its 
value  is  more  probable  than  the  pixel's,  and  there 
is  no  concavity  on  the  histogram  between  its  value 
and  the  pixel's;  these  conditions  imply  that  it 
belongs  to  the  same  histogram  peak  as  the  pixel, 
and  is  higher  up  on  that  peak.  This  method  can 
also  be  applied  to  multi-spectral  imagery,  using 
the  scattergram  rather  than  the  histogram  [9). 
Figure  1  shows  an  example  of  this  type  of  smooth¬ 
ing  applied  to  a  color  image  of  a  house,  using 
only  two  bsnds  (red  and  blue) .  The  result  is 
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quite  cartoon-like,  and  the  scattergram  of  the 
smoothed  image  is  virtually  reduced  to  a  small  set 
of  spikes. 

2.4  Segmentation  by  bimean  clustering 

The  mean  is  the  best-fitting  constant,  in  the 
least  squares  sense,  to  a  given  set  of  data.  We 
define  the  "biaean"  of  the  data  as  the  best-fitting 
pair  of  constants.  If  the  data  are  image  gray 
levels,  the  bimean  defines  a  segmentation  of  the 
levels  into  two  populations,  each  consisting  of 
those  levels  that  are  closer  to  one  of  the  con¬ 
stants  than  to  the  other.  An  algorlthn  for  find¬ 
ing  the  bimean  of  a  set  of  scalar  data  has  been 
developed.  It  yields  good  segmentations  in  some 
cases  which  are  not  well  segmented  by  the  two- 
class  ISODATA  clustering  algorittm.  The  details, 
and  examples,  can  be  found  in  [10J. 

3.  Feature  detection  and  texture  analysis 

3.1  Edge  and  corner  detection 

Hueckel-type  edge  detectors  are  based  on  find¬ 
ing  a  best-fitting  step  edge  to  a  given  image 
neighborhood.  Some  general  properties  of  such 
detectors  have  been  derived,  and  applied  to  de¬ 
fining  Hueckel-type  detectors  for  various  simple 
types  of  neighborhoods.  The  details  are  presented 
in  [11]. 

If  an  image  contains  an  object  on  a  contrast¬ 
ing  background,  corners  on  the  object's  contour 
give  rise  to  slope  changes  in  the  x-  and  y-axis 
projections  of  the  image.  Thus  detecting  such 
changes  indicates  which  rows  and  columns  of  the 
image  are  likely  to  contain  corners.  The  details 
of  the  approach,  as  well  as  examples,  were  pre¬ 
sented  in  [12]  (also  suimarized  in  [2]). 

3.2  Texture  analysis 

A  comparative  study  of  texture  classification 
using  various  types  of  features  was  conducted.  The 
best  features  were  (simplified  versions  of)  the 
"texture  energy  measures”  developed  by  Laws  at  USC. 
The  Laws  features  and  texture  samples  used  are 
shown  in  Figures  2  and  3,  and  the  results  are 
summarized  in  Table  1.  The  details  can  be  found 
in  [13]. 

Texture  analysis  methods  can  be  applied  to 
terrain  classification  using  arrays  of  elevation 
data,  rather  than  Intensity  data.  Some  simple 
examples  and  a  brief  discussion  can  be  found  in 
[14].  This  approach  will  become  of  increasing 
interest  as  high-resolution  digital  terrain  eleva¬ 
tion  data  becomes  available  over  the  coming  years. 

4.  Hierarchical  methods 

A  class  of  methods  for  image  segmentation  and 
object  detection  has  been  developed  that  makes  use 
of  a  "pyramid"  of  successively  reduced-resolution 
versions  of  the  image.  One  such  method  constructs 
subtrees  of  the  pyramid  representing  homogeneous 
eubpopuletlons  of  pixels,  by  creating  links  be¬ 
tween  nearby  pairs  of  pixels  on  consecutive  levels 


of  the  pyramid  based  on  their  similarity  in  value. 
This  method  has  been  generalized  to  mult lspectral 
imagery,  where  better  results  can  be  obtained  using 
two  bands  than  using  one  band  at  a  time.  The  de¬ 
tails  were  given  in  [IS]  (also  briefly  summarized 
in  [2]). 

Pyramid  linking  methods  can  also  be  used  to 
extract  significant  edges  from  an  image,  by  creat¬ 
ing  links  between  nearby  pairs  of  edge  segments  on 
consecutive  levels  based  on  similarity  in  slope. 

The  details  of  this  approach  were  given  in  [16] 
(also  briefly  summarized  in  [2]). 

A  more  recent  application  of  pyramid  linking 
is  to  the  detection  and  extraction  of  compact  ob¬ 
jects  from  an  image  using  local  "spoke  filters"  on 
each  level  of  the  pyramid.  This  method  is  de¬ 
scribed  in  detail  in  [S],  which  also  appears  in 
the  Workshop  Proceedings. 

Pyramid  linking  is  usually  based  on  forced 
choices,  where  a  pixel  must  link  to  one  of  the 
nearby  pixels  on  the  level  above  it.  A  "softer" 
approach  is  to  use  weighted  links  (the  more 
similar,  the  stronger).  This  too  gives  rise  to 
trees  whose  roots  are  pixels  that  have  only 
negligibly  weighted  links  to  the  level  above  them. 
Typically,  the  leaves  of  such  a  tree  constitute 
a  compact,  homogeneous  piece  of  the  image.  The 
approach  is  described  in  detail  in  [6],  which  also 
appears  in  the  Workshop  Proceedings. 

5-  Matching  and  motion 

5.1  Corner-based  image  matching 

Some  experiments  on  relaxation  image  matching, 
based  on  "corner"  features  extracted  from  the 
images,  were  described  in  [17]  (also  briefly 
summarized  in  [2]).  Further  experiments,  in 
which  local  gray  level  correlation  was  used  to 
resolve  ambiguous  cases,  are  described  in  [18]. 

5.2  Corner-based  motion  computation 

By  computing  (approximately)  the  spatial  and 
temporal  derivatives  of  the  image  gray  level  at  a 
given  pixel,  the  component  of  the  velocity  of  that 
pixel  in  the  gradient  direction  can  be  estimated. 

If  the  pixel  Is  at  a  "corner"  of  an  object,  where 
edges  having  two  different  directions  meet,  its 
velocity  is  thus  completely  determined.  When  the 
velocities  are  due  to  observer  motion  ("optical 
flow"),  knowing  them  at  a  few  points  suffices  to 
determine  the  translation  and  rotational  compon¬ 
ents  of  the  flow  [19],  When  an  object  is  moving, 
estimates  of  the  velocities  of  its  corners  can  be 
"propagated"  along  its  contours  to  yield  a  con¬ 
sistent  estimate  of  object  motion  [20,21].  Fur¬ 
ther  details  of  this  approach,  together  with 
examples,  are  presented  in  [8],  which  also  appears 
in  the  Workshop  Proceedings. 
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Figure  1.  Multispectral  "superspike",  a)  (Top)  Red 
and  green  oands  of  a  color  image  of  a 
house.  (Bottom)  Scatter  plot  of  (red, 
green)  values,  linearly  (left)  and  loga¬ 
rithmically  (right)  scaled. 


b)  Results  after  application  of  "supsr- 
spike";  the  parts  correspond  to  those 
in  (a). 


Figure  2.  28  texture  samples.  Left: 


grass,  raffia,  sand, wool.  Right:  three  geological  terrain  types. 
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Figure  3.  Four  5x5  Laws  masks. 


Feature: 

L5E5 

F.5S5 

L5S5 

R5R5 

CONX 

CONY 

E/A 

WE/A 

Score: 

23 

25 

22 

25 

20 

19 

19 

19 

Table  1.  Numbers  of  samples  correctly  classified  using  a  single  texture  feature.  CONX  and  CONY  are 

Haralick's  CON  feature  for  isplacements  (1,0)  and  (0,1);  (W)E/A  is  (magnitude-weighted)  amount 
of  edge  per  unit  area. 
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The  goals  ol  Image  Understanding  Research  at  CMU  hove  been 

A 

to  develop  basic  theory  tor  understanding  3-dlmensional  shapes 
and  to  demonstrate  an  integrated  system  for  photo  interpretation 
(database  and  interactive/automatic  image  interpretation 
techniques).  In  this  report  we  will  present  our  recent 
representative  progress  in  the  following  three  subprojects:  (1) 
~~MAPS;(2)  Incremental  3D  Mosaic  system ;  and  3)  Theory  tor  shape 
understanding. 


MAPS 


McKeown  [McKeown  82]  has  also  been  working  on  integrating  a 
concept  map  into  the  system.  The  concept  map  database  consists 
of  a  collection  of  concepts  each  describing  spatial  features  such 
a» political  areas  (states,  counties),  business  and  residential  areas, 
parks  and  natural  features  (rivers,  lakes),  and  man-made  features 
(airports,  power  stations,  universities).  The  entities  of  these 
concepts  are  hierarchically  organized  according  to  both  natural 
geometrical  relationships  (such  as  containment  and  intersection) 
and  level -of -detail  relationships.  In  this  way,  individual  map 
features  can  be  associated  with  high-level  semantic  map 
descriptions. 


MAPS  (fclap  Assisted  Photo- interpretation  System}  is  intended  to 
be  an  integrated  image/map  database  system  for  photo 
interpretation  tasks.  We  have  continued  to  upgrade  MAPS  since 
we  reported  in  the  April  1981  proceedings  [McKeown,  Kanade  81]. 
As  it  stands.  MAPS  is  a  large,  well-developed  system. 

One  of  the  new  modules  added  is  30  Map  display.  Using  the 
digital  terrain  database,  it  can  generate  and  display  a  30  view  of  an 
area  of  interest  from  a  specified  view  point.  Currently  the  output 
images  are  overlayed  by  color  coded  thematic*  which  are 
generated  by  scan  conversion  of  a  polygon  map  database 
provided  by  the  Defense  Mapping  Agency.  In  addition,  detailed 
cultural  features,  such  as  buildings,  roads  and  bridges,  can  be 
portrayed  in  30  views  using  the  DMA  map  as  a  base  map. 
BROWSE  [McKeown,  Deniinger  82],  a  window-oriented  (Replayed 
manager  used  in  MAPS,  provides  the  interface  to  this  capability. 

Potential  extensions  of  this  module  include  such  capabilities  a* 
displaying  terrain  profiles  across  a  specified  path,  displaying  a 
natural  view  by  patching  (rubber-sheeting)  an  aerial  photo  image 
over  the  terrain  model,  and  simulating  the  stream-flow  pattern  from 
a  specified  point. 


Once  the  concept  map  is  available,  since  the  image-to-map 
correspondence  has  been  established  in  the  system,  MAPS  can 
now  provide  access  to  imagery  and  handle  queries  to  the  database 
through  four  levels  of  access: 


Signal  level 


Symbolic  level 


Role  level 


Specify  a  point  or  area  in  the  displayed  image- 
The  resulting  image  coordinates  are  mapped  to 
map  coordinates  by  the  image-to-map 
correspondence,  which  in  turn  are  U3ed  to 
search  the  concept  map  database. 

Specify  a  symbolic  name.  A  user  defined  name 
(e.g.,  Memorial  Bridge )  is  mapped  into  the  map 
coordinates  by  the  concept  map,  and  these 
coordinates  can  be  used  to  access  related 
images. 

Specify  roles  (e.g.,  political)  of  interesting 
concepts.  Concepts  which  satisfy  the 
specification  are  searched  in  the  concept  map 
database. 


Geometric  level  Specify  the  latitude/longltude/elevation. 

Certain  geometric  properties  such  as 
containment,  intersection,  and  doeest-pofnt 
are  computed  at  this  level. 


MAPS  can  now  handle  queries  across  all  the  levels  of  access  in 
specifying  and  answering.  A  few  typical  examples  are: 

e  "Display  images  of  National  Airport  before  1978*  (Get  image 
from  symbolic  level) 

•  "What  is  the  doeest  political  building  to  this  [pointing  to  the 
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Figure  1 :  A  sample  target  area 


Figure  2:  3D  Mosaic  flowchart  boxes  are  ma»o'  modules  and 
ellipses  are  data  structures 
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Figure  3:  A  perspective  view  ot  the  buildings  generated  from 
the  3D  scene  model  that  has  been  constructed  by  the 
X  Mosaic  system 


three  space  direction  of  illumination  is  known  provides  one  degree 
of  constraint;  light  sources  whose  3D  direction  is  unknown  provide 
none.  However,  the  shadow  geometry  shows  that  shadows 
provide  important  benefits  for  image  understanding:  shadows 
allow  one  to  substitute  information  about  the  light  source  position 
instead  of  a  priori  knowledge  of  object  surface  orientation; 
shadows  allow  one  to  use  highly  visible  shadow  edge  pairs  in  place 
of  frequently  unreliable  edges  within  shaded  portions  of  an  image; 
increasing  amount  of  information  is  provided  when  the  shadow 
falls  on  many  visible,  differently  oriented  surfaces.  These  are  often 
experienced  in  photo  interpretation. 


Some  methods  have  been  found  for  combining  shadow 
geometry  with  other  sha,^  inference  methods,  such  as  Hom  s 
shape  from  shading  [Horn  77]  and  Kanade  and  Kender’s  skewed 
symmetry  [Kanade,  Kender  80].  Work  in  progress  includes 
extending  the  results  to  perspective  images  and  further  exploration 
of  curved  surfaces. 
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screen)  geographic  point  ?”  (Gel  symbolic  level  horn  role  and 
signal  levels  with  geometric  constraint) 

•  “How  many  bridges  cross  the  Potomac  River  between  Virginia 
and  the  District  of  Columbia?"  ( Get  symbolic  level  from 
symbolic  level  with  geometric  constraint) 

The  image/map  database  is  important  for  photo  interpretation 
because  interpretation  of  a  given  image  is  meaningful  only  in  the 
context  of  the  target  area.  That  context  should  be  quickly  provided 
to  the  interpreter,  and  the  interpretation  result  should  then  be 
stored  in  a  manner  that  can  be  used  in  the  future.  Thus  an 
integrated  image/map  database  system,  like  MAPS,  is  a  vital 
component  m  computer  photo  interpretation  systems. 

Incremental  3D  MOSAIC  System 

As  a  step  toward  automatic  interpretation  of  aerial  photos  of 
urban  aconsa.  we  have  been  developing  the  30  Mosaic  system 
[Herman,  (Canada.  Kuroe  82].  The  goal  of  this  system  is  to 
incrementally  acquire  a  3D  model  of  a  complex  urban  scene  from 
images  The  notion  of  incremental  acquisition  arises  from  the 
observation*  that  (1 )  single  images  contain  only  partial  information 
about  a  scene.  (2)  complex  images  are  difficult  to  fully  interpret, 
and  (3)  different  features  of  a  given  scene  tend  to  be  easier  to 
extract  in  different  images  because  of  differences  in  viewpoint  and 
lighting  conditions.  Our  method  involves  using  multiple  views  of 
the  scene  in  a  sequential  manner.  As  each  successive  view  is 
analyzed,  the  model  of  the  scene  is  incrementally  updated  with  the 
information  derived  from  the  new  view.  The  model  is  initially  an 
appraxm^ition  of  the  scene,  and  becomes  more  and  more  refined 
as  new  ima^ws  of  the  scene  are  acquired  and  processed. 

Currently  we  are  working  on  a  scene  in  Washington,  DC  (Figure 
1  shows  one  of  the  stereo-pair  images).  The  objective  is  to  obtain  a 
30  scene  description  of  buildings  in  the  area.  As  shown  in  Figure 
2.  the  system  consists  of:  a  module  lor  stereo  analysis,  a  module 
for  building  and  modifying  the  3D  scene  model,  and  a  module  for 
generating  images  from  new  viewpoints  lor  display  purposes. 
First,  stereo  analysis  based  on  matching  junctions  and  lines 
generates  a  30  wire-frame  description.  A  surface-based  scene 
model  (represented  as  a  structure  graph)  is  then  derived  from  this 
description.  Currently,  the  model  includes  only  planar  surfaces 
and  buildings  are  approximated  by  potyhedra.  Figure  3  shows  a 
display  of  the  30  model  obtained. 


Once  the  3D  model  is  built,  it  can  be  used  for  various  purposes 
including:  (1)  matching  the  model  against  new  images  ot  the  scene 
for  such  tasks  as  model  revision  and  change  detection:  (2) 
synthesizing  images  as  seen  from  hypothetical  view  points  for 
familiarizing  personnel  with  the  area;  and  (3)  using  the  model  for 
planning  paths  for  flight  plans  or  robot  navigation  tasks.  In  this 
way,  the  3D  Mosaic  scene  model  plays  the  role  of  a  central 
description  which  (1)  reflects  the  current  understanding  of  the 
scene.  (2)  assimilates  new  information  about  the  scene,  and  (3) 
permits  decisions  dealing  with  the  scene  environment  to  be  made. 

Theory  for  Shape  Understanding 

At  CMU,  we  have  been  working  on  fundamental  theories,  such 
as  the  theory  of  shape  from  texture  [Kender  80]  and  the  theory  of 
mapping  image  properties  into  shape  constraints  (Kanade 
81,  Kanade.  Kender  80],  which  provide  mechanisms  to  understand 
3- dimensional  scenes  from  images.  Our  emphasis  continues  to  be 
in  the  geometrical  aspects  of  image  constraints  for  extracting 
shape,  and  we  have  added  new  results  in  thie  area. 

Shadow  Geometry 

We  have  been  studying  the  geometric  interpretation  of  shadows 
in  images.  Given  a  line  drawing  with  shadow  regions  identified  and 
correspondences  established  with  shadow  making  regions,  Shafer 
and  Kanade  [Shafer.  Kanade  82]  have  developed  a  theory  which 
describes  the  resulting  geometrical  constraints  upon  the 
orientations  of  the  surfaces  involved. 

ASasic  Shadow  Problem  is  first  posed  in  which  there  is  a  single 
light  source,  and  a  single  surface  casts  a  shadow  on  another 
(background)  surface.  There  are  six  parameters  to  determine:  the 
orientation  (2  parameters)  for  each  surface,  and  the  direction  of 
the  vector  (2  parameters)  pointing  at  the  light  source.  It  was  found 
that  if  some  set  of  3  of  these  are  given  in  advance,  the  remaining  3 
can  then  be  determined  geometrically  from  the  image.  (Note  that 
this  includes  the  commonly  occurring  cases  of  inferring  shape  with 
known  sun  angle  and  ground  plane.)  The  solution  method  consists 
of  identifying  'illumination  surfaces”  consisting  of  illumination 
vectors,  assigning  Huffman-Ciowes  'Joe  labels  to  their  edges,  and 
applying  the  corresponding  constraints  in  gradient  space.  A 
closed-form  solution  has  been  found  for  this  problem. 

When  multiple  Hght  sources  and  multiple  surfaces  are  involved, 
there  is  no  essential  dffference.  Each  light  source  for  which  the 


Shape  Representation 

We  are  exploring  trie  definition,  notation,  and  properties  of  two 
types  of  shape  representation:  gradient  space  and  generalized 
cylinders.  Shafer,  Kanade,  and  Kender  (now  at  Columbia)  [Shafer, 
Kanade.  Kender  82]  have  compiled  a  comprehensive  presentation 
of  the  gradient  space,  beginning  with  basic  definitions.  Vector 
gradients  are  introduced  as  a  useful  addition  to  surface  gradients: 
the  gradient  of  a  vector  is  the  same  as  the  gradient  of  the  surfaces 
normal  to  that  vector.  Vector  gradients  allow  elegant  statements 
about  perpendicularity  and  polyhedral  edges  in  the  gradient  space. 

The  relationship  between  perspective  and  the  gradient  space 
has  been  fully  explored,  beginning  with  definitions  of  vanishing 
points  and  lines,  which  are  very  closely  related  to  vector  and 
surface  gradients.  The  vanishing  gradient  of  an  image  line  has 
been  defined  as  the  gradient  of  the  surfaces  for  which  that  line  is 
the  vanishing  line.  Using  vanishing  gradients,  the  connect-edge 
relationship  for  perspective  has  been  defined  with  more  precision 
than  in  past  work. 

We  have  also  been  studying  the  fundamental  properties  of 
generalized  cylinders.  Beginning  with  a  formal  mathematical 
definition  of  generalized  cylinders,  we  have  identified  several 
subclasses  of  particular  interest,  including  those  with  linear  (line 
segment)  axes,  those  with  linear  scaling  functions,  those  with 
cross-sections  perpendicular  to  the  axis,  and  those  with  circular 
cross-sections.  Formulae  for  the  coordinates  of  points  on  the 
surface  of  a  generalized  cylinder  and  for  the  surface  normals  have 
been  derived. 

Several  interesting  theorems  have  been  proven  regarding  the 
existence  of  multiple  representations  for  the  same  solid  shape,  and 
the  directions  of  surface  normals  at  various  points  on  the  shape. 
Current  work  is  centered  around  the  twin  problems  of  predicting 
the  projected  shape  of  a  generalized  cylinder  from  various 
viewpoints,  and  analyzing  an  image  (silhouette  or  range  data)  to 
deduce  a a  much  as  possible  about  the  solid  shape. 

Occluding  Boundaries  in  Computing  Optical 
Flow 

We  [Cornelius,  Kanade  82)  have  developed  an  algorithm  that 
assigns  velocities  to  image  points  by  explicitly  taking  object 
boundaries  into  account  The  iterative  method  by  Horn  and 
Schu nek  [Horn,  Schunck  81)  for  computing  optical  flow  (the 
distribution  of  apparent  veiocitiee  of  movement  of  brightness 


patterns)  from  an  image  sequence  assumes  that  the  velocities  vary 
smoothly  over  the  entire  image.  This  assumption,  however,  has 
limited  utility  in  real  images  where  object  boundaries  are  usually 
places  of  velocity  discontinuity.  The  method  of  Cornelius  and 
Kanade  assumes  that  the  velocity  and  resultant  image  changes 
vary  smoothly  over  bounded  regions  corresponding  to  objects.  It 
also  allows  changes  in  the  brightness  patterns  (due  to  change  in 
pattern  of  the  object  or  lighting  conditions)  so  that  velocities  more 
closely  represent  the  motions  of  objects  projected  onto  the  image 
plane. 

Discontinuities  in  velocity  which  occur  at  object  boundaries 
must  be  explicitly  accounted  for  in  order  to  accurately  determine 
velocities  within  the  boundaries.  To  allow  for  these  discontinuities, 
the  smoothness  constraint  is  applied  separately  to  regions  on 
either  side  of  a  boundary.  This  can  be  done  once  the  projections 
of  the  object  boundaries  have  been  located  in  the  image. 
Interestingly,  implementation  does  not  require  that  the  image  be 
segmented  into  regions  corresponding  to  objects,  rather  only  that 
the  location  of  possible  object  boundaries  be  determined. 

A  change  in  the  brightness  pattern  refers  to  the  change  in  image 
brightness  of  the  same  physical  point  on  an  object  from  one  frame 
to  the  next.  This  might  occur  when  an  object  rotates  and  the 
lighting  hits  the  object  in  a  different  way.  For  x-ray  images  of  a 
beating  heart,  the  brightness  at  a  point  in  the  image  is  dependent 
on  the  depth  of  the  heart  cavity  perpendicular  to  the  image  plane. 
Thus  the  pattern  changes  will  reflect  the  expansion  or  contraction 
movement  of  the  heart  in  the  direction  perpendicular  to  the  image 
plane.  A  pattern  change  will  also  occur  when  a  point  on  the  abject 
is  obscured  or  revealed  in  successive  image  frames.  This  second 
type  of  change  causes  discontinuities  in  the  velocity  across 
boundaries,  which  have  already  been  discussed.  To  allow  for 
pattern  changes  in  the  image,  the  rate  of  brightness  change  of  a 
given  physical  point  is  treated  as  another  velocity  component 
under  the  constraint  that  it  changes  smoothly  within  boundaries. 
While  this  rate  of  brightness  change  is  not  strictly  a  velocity,  it 
makes  sense  to  constrain  it  to  vary  smoothly  within  object 
boundaries,  just  as  is  done  for  the  velocity  components. 

The  methods  developed  are  applied  to  models  of  ellipsoids  and 
boxes  undergoing  expansion  and  rotation,  and  to  x-ray  Image 
sequences  of  a  beating  heart. 
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Range  Data  Acquisition  and  Analysis 

We  have  developed  a  laser-scanning  ranging  device  [Kanade, 
Asada  81 J,  [Kanade.  Fuhrman  82).  This  device  provides  range 
measurement  of  approximately  0.3  mm  resolution  in  the  cubic 
space  of  20x20x20  cm3  located  60  cm  from  the  sensor;  the  speed 
of  measurement  is  1000  points/sec.  It  will  also  simultaneously 
measure  reflectance  of  surface  points. 

We  are  working  on  techniques  for  obtaining  object-centered 
descriptions  of  objects  from  range  data.  This  bottom-up 
procedure  is  essential  for  tasks  such  as  learning  shape  by 
presentation.  In  working  on  this  problem.  Smith  (Smith  82]  has 
characterized  and  classified  contours  which  appear  in  range 
images  obtained  by  triangulation.  His  classification  provides 
insight  into  the  problem  common  to  shadows,  stereo,  and 
triangulation-based  range  finders. 

When  deducing  object  geometry  from  the  contours,  it  is 
important  to  understand  how  reliable  they  are.  However,  this  topic 
has  not  previously  been  addressed.  Indeed,  some  types  of 
contours  which  are  possible  with  a  triangulation  range  finder  have 
escaped  notice. 

The  most  reliable  contours  are  sharp  occluding  contours. 
Receding  occluding  contours  have  more  uncertainty  in  position 
along  the  line  of  sight  of  either  the  illuminator  or  camera, 
whichever  has  the  oblique  view.  An  occluding  contour  provides 
information  about  the  object  boundary.  An  occluded  boundary,  on 
the  other  hand,  is  a  projection  of  an  occluding  contour  onto  the 
occluded  object  surface  along  the  line  of  sight  of  either  illuminator 
or  camera:  the  relationship  between  rccluding  and  occluded 
contours  is  equivalent  to  that  of  shadow-making  and  caated- 
shadow  edges.  A  problem  is  to  determine  which  contours  are 
occluding  and  which  are  occluded. 

Figure  4  depicts  a  triangulation  range  finder  imaging  a  sphere 
and  an  object  behind  it.  A  point  on  the  surface  of  the  sphere  can 
be  ranged  only  when  it  is  visible  from  both  the  Hluminator  and  the 
camera.  The  deep  shadow,  which  is  visible  from  neither  vantage 
point,  is  called  the  umbra.  The  mace  which  is  shadowed  from  one 
vantage  point,  but  not  the  other,  is  called  the  penumbra.  A  contour 
separating  the  rang  able  face  of  the  sphere  from  a  portion  in  the 
penumbra  is  an  occluding  penumbral  contour,  a  contour 
separating  a  section  lying  in  the  penumbra  from  a  section  lying  in 
the  umbra  is  an  occluding  umbra I  contour.  These  classes  are 


further  split  into  subclasses  which  denote  whether  the  contour  is 
caused  by  the  limit  of  view  of  the  illuminator  or  the  camera. 

Each  contour  in  Figure  4  is  labeled  with  a  code.  C  and  /  denote 
whether  it  was  caused  by  the  vantage  point  of  the  camera  or  the 
illuminator,  p  and  u  indicate  penumbral  and  umbral  contours.  1 
and  2  indicate  occluding  and  occluded  contours.  Thus,  a  CpI 
contour  arises  where  an  object  curves  away  from  the  camera,  and 
this  contour  can  cause  a  Cp2  contour  on  an  object  behind. 

The  occluding/occluded  contour  pairs  used  by  Sugihara 
[Sugihara  79]  correspond  to  Ip1'p2  and  Cpt-Cp2  pairs.  An 
Ip1-lp2  pair  is  easily  spotted  in  the  deflection  image  in  the  natural 
(i.e.,  camera)  registration,  as  it  produces  a  range  (deflection)  jump. 
A  Cpl-Cp2  pair  can  be  spotted  without  conversion  to  3D 
coordinates  by  reconstructing  the  image  from  the  illuminator's 
point  of  view. 

The  task  of  determining  which  contours  are  occluded  >s 
complicated  by  the  fact  that  some  occluded  contours  are 
produced  by  umbral  contours  --  yet  the  umbral  contours  (along 
with  part  of  the  occluding  surface)  are  missing  in  the  image.  There 
is  a  gap  between  the  observed  occluded  edge  and  the  one  which 
would  be  expected  if  only  the  visible  surface  of  the  occluding 
object  were  casting  the  shadow.  These  gaps  are  the  Iu2-Cx2  and 
Cu2  lx2  portions  of  the  occluded  object.  This  suggests  that  once 
we  can  identify  penumbral  occluding  and  umbral  occluded 
contours  for  both  camera  and  illuminator  (i.e.,  Cpf,  Ipl,  Cu2  and 
Iu2 )  we  can  infer  the  shape  of  a  cross  section  more  accurately  than 
simply  by  fitting  a  curve  to  visible  contours  [Agin,  Binford  73]. 

Although  one  cannot  be  certain  whether  a  contour  is  an 
occluding  contour  or  a  u2  contour,  there  are  some  necessary 
conditions: 

•  An  Iu2  contour  occurs  only  when  the  deflection  value 
undergoes  a  drop  between  the  left  and  right  sides  of  a  validity 
gap  in  the  natural  registration. 

•  A  Cu2  contour  occurs  only  when  the  deflection  value 
undergoes  a  drop  between  the  (eft  and  right  sides  of  a  validity 
gap  in  the  camera  registration. 

Smith  has  also  developed  some  additional  heuristics  which  help 
in  identifying  contour  types.  His  algorithms  for  detecting  and 
classifying  contours  have  been  successfully  tested  with  real  range 
data  of  complex  acenes.  Work  is  in  progress  to  obtain  shape 
representation  of  objects  in  the  scene  for  recognition. 
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This  paper  summarizes  our  major  research 
activities  since  the  last  Image  Understanding 
Workshop  in  April  1981.  —Mo*e  details  can  be  found 
in  four  other  detailed  technical  papers  describing 
our  work  in  these  proceedings  Our  research 
goals  have  been  to  develop  a  set  of  general 
techniques  that  can  be  applied  to  a  number  of 
application  tasks.  Our  focus  at  the  high  level  has 
been  on  symbolic  matching  of  an  image  to  a  map  or 
another  image.  This  task  is  central  to  many 
applications,  e.g.,  image  based  navigation, 
map-updating,  and  change  detection.  Our  other 
research  has  been  in  support  of  this  high  level  goal 
and  has  included  work  on  segmentation,  texture 
analysis,  object  detection  and  description.  We  have 
also  worked  with  Hughes  Research  Laboratories  in 
defining  and  developing  special  purpose  hardware  for 
IU  tasks.  ^ _ 

2.  Image  to  Mag  Correspondence 


We  have  developed  two  major  systems  for  this 
task.  The  first  matches  line  segments  found  in  an 
image  with  corresponding  lines  in  the  model  (which 
may  be  a  map  or  another  image) .  This  method  is 
applicable  when  geometric  distortions  are  small  and 
precise  positions  of  lines  or  edges  can  be  used  for 
matching.  It  is  fast  and  tolerant  to  local  errors. 
When  a  line  segment  in  the  image  corresponds  to  a 
segment  in  the  model  (or  another  image)  the  location 
of  all  other  segments  are  restricted  to  limited 
areas  with  restricted  orientations.  A  small  subset 
of  the  lines  from  one  image  are  used  in  an  initial 
relaxation  baaed  scheme  **iich  tests  pairwise  matches 
using  the  above  constraint.  Given  this  initial 
kernel  all  other  segments  can  be  quickly  matched  or 
discarded  as  (matched  using  the  constraints  given 
by  the  kernel.  Details  of  this  method  and  some 
results  are  given  in  [1]. 


The  second  matching  system  is  the  result  of 
long  term  *»rk  in  general  symbolic  analysis  aid  has 
depended  on  development  of  other  image  analysis 
technics  *Kh  as  high  performance  line  finders  and 
general  segmentation  techniques.  This  method 
matches  a  network  descriptions  derived  from 
extracted  lines  and  regions  and  their  relationships. 
We  have  presented  the  results  of  this  system  at 
previous  id  workshops  [51  >  our  recent  work  uses 
groups  of  features  in  a  relaxation  matching  system 
and  is  described  in  detail  in  another  paper  in  these 
proceedings  [2). 


3.  Symbolic  Texture  Analysis 

We  have  been  developing  a  system  for  symbolic 
descriptions  of  textures;  the  descriptions  are  in 
terms  of  primitives  and  their  arrangement.  The 
basic  approach  is  to  find  micro-edges  and  search  for 
repetitive  patterns  among  them.  This  process  gives 
us  the  periodicity  of  the  texture,  if  any,  and  the 
dominant  size  of  the  elements,  if  any.  Next,  the 
complete  primitives  are  isolated  and  then  their 
geometrical  arrangements  computed.  The  symbolic 
descriptions  are  sufficient  to  reconstruct  regular 
textures. 

We  have  also  applied  these  descriptions  to  the 
recognition  of  natural  textures,  commonly  used  in 
texture  work,  such  as  wool,  water,  raffia,  grass, 
etc.,  and  achieved  a  better  than  90%  accuracy.  The 
errors  are  between  similar,  random  textures  and 
could  be  easily  reduced  if  metric  information  (size) 
were  used.  We  have  also  applied  these  descriptions 
to  estimating  3-D  surface  orientations  using  texture 
gradients. 

CXir  basic  description  technique  has  been 
presented  at  a  previous  IU  workshop  [6);  the  new 
wark  including  surface  orientation  estimation  is 
presented  in  another  paper  here  [3].  Complete 
details  may  be  found  in  a  USC  ih.D.  thesis  [7). 

4.  Object  Detection 

We  have  made  a  start  on  detecting  3-D  objects 
based  on  analysis  of  ^aadows  they  cast.  Figure  1 
shows  part  of  an  aerial  image  with  some  buildings 
and  their  shadows.  The  buildings  may  be  hard  to 
distinguish  from  other  similar  shaped  structures, 
e.g.,  a  parking  lot,  without  shadows.  We  are  able 
to  correspond  shadows  and  objects  under  certain 
simplifying  assumptions:  a  known  distant  source  of 
light,  vertical  objects  seen  straight  down  (vertical 
sides  not  seen) ,  flat  and  level  ground,  and  the 
object  shape  being  generically  known  (a  combination 
of  rectangles  in  this  case). 

Our  basic  approach  is  to  extract  line  segments 
and  corners  first  (see  Figs.  2  and  3  for  example) 
and  to  characterize  the  corners  as  being  likely  to 
belong  to  a  building  or  not.  This  is  based  on  the 
comers  having  the  appropriate  photometric 
properties  (brighter  than  surround)  and  also  on 
being  able  to  find  a  matching  shadow  comer.  The 
evidence  from  the  individual  corners  is  combined 
when  the  corners  share  a  segment.  If  a  closed 
boundary  is  found,  it  is  examined  for  consistency  of 
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shape.  Else,  the  partial  boundary  serves  as  a  guide 
for  searching  for  potential  missing  ports.  Figure  4 
shows  the  building  like  structures  extracted  in 
Fig.  1.  The  top  building  was  found  as  a  closed 
boundary  by  the  program.  The  lower  one  required 
searching  for  nearby  segments  and  connecting  two 
halves  because  they  compliment  each  other  according 
to  the  program’s  predictions.  In  addition  to  the 
segments,  we  are  also  able  to  determine  the 
(relative)  heights  of  the  buildings.  The  details  of 
this  process  may  be  found  in  our  recent  progress 
report  [8]. 


Hardware  Develop 


In  work  with  Hughes  Hesearch  laboratories,  we 
have  continued  the  development  of  the  RADIUS 
processing  system.  RADIUS  is  capable  of  performing 
a  variety  of  programmable  operations,  such  as 
convolution  and  statistical  manents.  The  current 
implenentation  uses  a  5x5  kernel,  but  is  modular  and 
expandable  to  larger  sizes.  The  RADIUS  processor 
has  been  constructed  and  tested.  It  is  to  be 
interfaced  to  a  PDP-11  via  a  UNIBUS,  allowing 
further  processing  on  a  general  purpose  machine. 


The  choice  of  the  operation  for  the  RADIUS 
processor  was  based  on  an  analysis  of  three 
low-level  processing  systems:  Nevatia-Babu  line 
finder  19],  law’s  "texture  energy"  measure  [10]  and 
Ohlander-Price  region  segmentation  [11]. 


We  have  also  begun  a  study  of  the  processing 
required  to  perform  operations  after  convolution, 
e.g.,  thinning,  linking,  shrink  and  expand.  We 
believe  that  such  operations  can  be  performed  by  a 
single,  programmable,  grey-level  "logic  processor." 


Details  of  the  hardware  implenentation  effect 
may  be  found  in  [4],  in  these  proceedings. 
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JL  Robust  Vision  Operators  ^ 

1.1.  Parameter  Networks  and  the  Hough  Transform 

One  of  the  most  difficult  problems  in  vision  is 
segmentation.  Recent  work  has  shown  how  to  calculate 
intrinsic  images  (e.g.,  optical  flow,  surface  orientation, 
occluding  contour,  and  disparity).  These  images  are 
distinctly  easier  to  segment  than  the  original  intensity 
images.  Such  techniques  can  be  greatly  improved  by 
incorporating  Hough  methods.  The  Hough  transform  idea 
has  been  developed  into  a  general  control  technique. 
Intrinsic  image  points  are  mapped  (many  to  one)  into 
‘parameter  networks’  [Ballard,  1981],  This  theory  explains 
segmentation  in  terms  of  highly  parallel  cooperative 
computation  among  intrinsic  images  and  a  set  of 
parameter  spaces  at  different  levels  of  abstraction. 

The  most  recent  application  of  these  ideas  are  to 
improved  shape- from  shading  calculations  [Brown  et  al„ 
1982]  and  motion  extraction  [Ballard  &  Kimball,  1982], 
Both  of  these  domain  specific  efforts  are  closely  linked  to 
our  new  work  on  a  more  general  theory  of  Hough-like 
computations  and  general  implementation  techniques  for 
them.  Hough  like  computations  may  be  modeled  as  a 
form  of  imaging  [Brown  1982],  The  resulting  insights 
suggested  the  CHough  technique  to  sharpen  peaks  and 
reduce  bias  in  Hough  accumulator  space  [Brown  & 
Curtiss,  1982], 

The  theory  is  also  useful  in  analysis  of  cache- based 
Hough  Transform  implementations.  It  is  an  appealing  idea 
to  use  a  small  content-addressable  store  to  accumulate 
Hough  transform  results,  rather  than  a  potentially  huge 
multi  dimensional  array.  Several  technical  issues  aha 
involved  in  any  such  scheme,  but  the  idea  appears  quite 
promising  [Brown  &  Sher,  1982). 

1.2  Adaptive  Operators 

Control  is  a  crucial  issue  vn  (mage  Understanding.  We 
have  been  investigating  the  role  of  low-level  adaptive 
operators  in  Jboth  the  analysis  of  aerial  images  and  in 
problem  solving.  Early  aerial  image  work  was  reported  in 
the  last  IU  proceedings  [Selfridge  and  Sloan,  1981}. 

In  general,  problem  solvers  cannot  hope  to  create  plans 
that  are  able  to  specify  fully  all  the  details  of  operation 
beforehand  and  must  depend  on  run-time  modification  of 
the  plan  to  insure  correct  functioning.  Fortunately,  many 


primitive  actions  are  highly  stereotyped  and  can  be 
performed  by  adapting  pre-programmed  tactics  to  the 
current  goal  context  and  operating  environment.  The 
recently  completed  thesis  of  Selfridge  [1982]  shows  how 
this  idea  can  be  applied  to  fairly  difficult  problems  in 
aerial  image  understanding.  Related  efforts  are  underway 
in  robot  constrution  tasks  and  intelligent  interfaces  to 
distributed  systems. 

1.3  Medical  Applications 

A  system  has  been  built  in  which  Computer 
Tomograms  of  the  human  abdomen  are  searched  as  a  3-D 
image  and  matched  against  a  detailed  geometrical  model 
of  the  abdomen  anatomy.  Delected  organ  boundaries 
serve  to  construct  an  instance  of  the  model  that  reflects  the 
actual  anatomy  of  a  particualr  patient  as  revealed  by  the 
corresponding  image  data.  The  model  directed  approach 
makes  possible  the  detection  of  hard  to  find  organs  (e.g., 
kidneys)  based  on  known  locations  of  easy-to  find  organs 
(e.g„  spinal  column),  thus  relaxing  the  problem  of 
obscured  boundaries  in  noisy  data  that  tend  to  hinder 
data-directed  approaches.  The  model  is  hierarchical,  built 
of  generalized  cylinders,  and  is  inherently  parallel.  It 
captures  relational,  structural,  and  quantitative  knowledge 
that  is  represented  as  both  data  and  procedures  [Shani, 
1981], 

Work  was  also  recently  completed  on  a  system  for 
reconstructing  the  shape  of  the  human  heart  from 
ultrasound  data  [Schudy.  1982].  Our  work  on  the 
automated  detection  of  possible  tumor  sites  in  chest 
radiographs  has  reached  the  stage  of  extensive  testing.  All 
of  these  efforts  are  contributing  both  specific  and  general 
techniques  to  our  DARPA  tasks. 

2.  Computing  with  Connections  ^  -  ^  fi, 

mere  is  a  rapidly  growing  interest  in  problem-scale 
parallelism,  both  as  a  model  of  animal  brains  ans  as  a 
pradigm  for  VLSI.  Work  at  Rochester  has  concentrated  on 
connectionist  models  and  their  application  to  vision.  The 
framework  is  built  around  computational  modules,  the 
simplest  of  which  are  termed  p-units.  We  have  developed 
their  properties  and  shown  how  they  can  be  applied  to  a 
variety  of  problems  [Feldman  &  Ballard,  1982],  More 
recently,  we  have  established  powerful  techniques  for 
adoption  and  change  in  these  networks  [Feldman,  1982]. 

One  view  of  this  work  is  that  i(  extends  our 
development  of  robust  Hough-like  operators  to  more 


complex  IU  tasks.  A  major  milestone  was  achieved  with 
Sabbah’s  thesis  on  massively  parallel  recognition  of 
Origami- world  objects  [Sabbah,  1982],  Sabbah’s  work 
extended  the  connectionist  methodology  to  a  problem 
domain  with  several  hierarchical  structural  levels.  The 
resulting  program  is.  to  our  knowledge,  the  most  noise- 
resistant  system  for  dealing  with  this  level  of  complexity, 
f  igure  1  shows  the  program  converging  to  the  correct 
reading  of  a  self-occluding  object.  One  outcome  of 
Sabbah's  effort  is  the  start  of  a  project  to  build  a  general 
purpose  simulator  for  massively  parallel  systems  [Shastri  et 
1982], 

Shape  N 

The  description  and  recognition  of  complex  shapes 
continues  to  be  a  major  focus  of  the  project.  The  analysis 
of  the  dot  product  space  representation  has  been  improved 
to  handle  certain  pathological  cases,  and  has  been 
generalized  to  accommodate  different  criteria  for  the 
goodness  of  the  representation. 

This  simple  concept  of  shape  has  been  applied  to  the 
problem  of  reconstructing  three  dimensional  surfaces  from 
very  sparse  data.  The  key  idea  is  to  use  appropriate  shape 
descriptors  to  hypothesize  a  transformation  which  accounts 
for  the  difference  in  shape  between  successive  contours. 
When  (he  hypothesized  transformation  is  minor,  very 
simple-minded  surface  reconstruction  techniques  are 
sufficient.  When  there  are  major  differences  in  shape  or 
position  between  successive  contours,  our  method 
hallucinates  new  contours,  using  the  hypothesized  shape 
transformation  [Sloan  and  Hrechanyk,  1981). 

More  recent  efforls  treat  shape  within  the  connectionist 
framework  described  above.  Hierarchical  descriptions  of 
Shapes  were  considered  in  [Ballard  &  Sabbah,  1981];  A 
major  current  project  involves  developing  techniques  for 
recognizing  articulated  objects,  whose  shape  is  subject  to 
change  [Hrechanyk  &  Ballard,  1982], 
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General  Theory  of  Vision  , 

Work  in  our  laboratory,  among  others,  has 
demonstrated  strong  links  between  powerful  IU 
techniques  and  computations  used  by  animal  visual 
systems.  We  have  established  strong  ties  with  a  wide  range 
of  visual  scientists  at  Rochester  and  a  variety  of 
collaborative  efforts  are  underway.  One  early  project  is  to 
survey  the  computational  similarities  in  natural  and 
computer  vision  [Ballard  &  Coleman,  1982],  Another 
effort  is  our  attempt  to  develop  a  general  framework  for 
theories  of  vision  that  would  provide  a  common  structure 
for  integrating  studies  from  various  disciplines  [Feldman, 
1982], 
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rigure  i  Massively  parallel  neiv  or*  recognition  ot  open 
box  from  edge  data.  The  extra  lines  depict  levels  of 
abstraction  successively  developed  by  the  sytem.  [cf. 
Sabbah  1982], 
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Abstract 

Progress  has  been  made  on  extensions  to  ACRONYM 
which  include)  representation  and  reasoning  with 
time,  events,  and  sequences)  collaboration  with  MIT  to 
develop  geometric  learning;  representation  of  function, 
and  reasoning  between  structure  and  function.  A  new 
ribbon  finder  for  ACRONYM  is  under  construction. 
Work  in  figure/ground  separation  is  underway  as  a 
basis  for  the  ribbon  finder.  Preliminary  results  are 
shown  in  grouping  operations  to  determine  regularities 
in  images.  A  stereo  system  has  been  completed  which 
combines  edge-based  stereo  matching  with  surface 
interpolation  utilizing  correspondence  of  gray  levels. 
Design  of  a  new  stereo  vision  system  is  underway. 

Introduction  \ 

The  ACRONYM  system  is  in  use  by  Hughes  in  an  Image 
Understanding  project.  ACRONYM  has  been  transported 
to  the  VAX  by  Brooks,  and  now  runs  on  the  Testbed, 
without  the  ribbon  finder. 

We  see  the  need  in  applications  of  ACRONYM  to 
Interpret  time-varying  phenomena,  to  identify 
coherent  sequences  of  events,  to  integrate  apace/tlma 
information  in  interpretation.  We  have  added  a  new 
capability  to  ACRONYM  to  represent  and  to  reason  with 
time,  events,  and  sequences  [Malik  82], 

Efforts  are  underway  to  simplify  the  programming  of 
applications  in  ACRONYM.  We  have  begun  a 
collaboration  with  MIT  to  develop  geometric  learning 
for  vision  to  incorporate  in  ACRONYM.  Initial  results 
are  reported  in  [Binford  82aJ.  A  central  issue  in 
learning  is  representation.  The  key  theme  here  is 
function  ■  structure,  which  provides  a  solution  to  the 
problem  or  representing  function,  function  is 
represented  by  structure  in  space/time.  [Lowry  82] 
discusses  reasoning  between  structure  and  function. 
We  have  also  begun  defining  a  task  specification 
language  for  ACRONYM. 

In  previous  teste  of  ACRONYM'S  Interpretation  of 
aircraft  in  images,  its  performance  was  limited  by  the 
quality  of  the  input  data  it  received,  ribbons  from  a 
goal-directed  ribbon  finder  [Brooks  80].  [Marimont  82] 
describes  progress  on  a  new  ribbon  finder  for 
ACRONYM,  based  on  an  improved  edge  finder.  The 
ribbon  finder  segments  edges  into  curved  segments 
which  are  cubic  splines. 


One  paradigm  for  interpretation  in  ACRONYM  has  been 
model-based,  goal-directed  vision.  ACRONYM  selects 
candidates  in  Images  and  teste  them  against  three-space 
models.  To  select  candidates,  ACRONYM  matches  image 
predictions  with  image  observations.  ACRONYM  has 
mechanisms  to  make  quite  general  predictions;  it 
predicts  quasi-invariants  in  images,  it  predicts 
appearances  of  features  rather  than  total  views  (for  a 
polyhedron  of  n  faces,  there  are  2tn  total  views),  its 
predictions  are  symbolic  expressions  with  variables. 
Although  ACRONYM  does  it  about  as  well  as  can  be 
expected,  interpretation  which  is  based  on  prediction 
of  Images  has  limited  generality. 

The  other  paradigm  for  interpretation  In  ACRONYM  is 
constructive  inference.  This  approach  is  the  focus  of 
our  work.  This  form  of  interpretation  is  essential  for  a 
general  vision  system,  although  it  is  much  more 
difficult  than  the  model-directed  approach.  Powerful 
mechanisms  for  genersl  Interpretation  of  images  as 
surfaces  In  three-space  were  described  in  [Binford  81] 
and  [Lowe  81],  These  procedures  interpret  local 
three-dimensional  cues  given  by  image  curves.  [Lowe 
82]  and  [Binford  82b]  extend  this  work  in  several 
ways,  centered  on  the  theme  of  figure/ground 
separation.  One  extension  is  to  consider  inference  of 
grouping  processes  in  the  image,  without  a  clear 
three-dimensional  interpretation.  The  problems  of 
finding  clusters  of  dote,  edges,  and  texture  groupings 
are  included.  The  other  extension  is  to  include  a  naw 
class  of  cues  to  grouping  image  curves  into  objects,  in 
forming  ribbons  and  grouping  ribbons. 

Continued  research  in  stereo  vision  has  been  sponsored 
in  part  by  this  IU  program,  augmenting  support  by 
RADC.  [Baker  82]  reports  on  a  system  which  combines 
edge-based  stereo  matching  with  surface  interpolation 
utilizing  correspondence  of  gray  levels.  Member*  of  the 
IU  group  have  made  an  *x tensive  critical  survey  of 
stereo  vision  systems  and  the  supporting  vision 
technology  [Binford  82c],  Design  of  a  naw  stereo  vision 
system  la  underway. 

A  central  problem  in  machine  vision  is  Implementing 
vision  algorithms  in  real  time,  that  is,  fast  enough  for 
requirements  of  applications.  [Miller  82b]  describe 
architecture  studies  aimed  at  development  of  an  array 
of  128x126  processors  In  wafer  scale  integration.  They 
demonstrate  an  algorithm  for  routing  the  array  around 
defective  elements,  an  algorithm  affective  for 
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defective  processor  rates  less  than  10%.  They  describe 
software  experiments  used  to  explore  Issues  of 
architecture. 

Geometric  Reasoning 

The  representation  of  time  assumes  events  and  an 
observer  with  an  internal  clock,  l.e.  a  mechanism  for 
Judging  before  and  after,  with  an  internal  metric. 
Semantically  related  clusters  of  events  have  local  time 
frames  [Malik  82].  Coordinate  transformations  relate 
time  frames.  Events  may  be  continuous  or  discrete.  The 
representation  of  time  makes  use  of  the  same 
mechanisms  as  the  representation  of  space.  Parallelism 
and  synchronization  are  represented  by  constraints. 
Thus,  the  ACRONYM  constraint  manipulation  system 
provides  most  of  the  mechanism  for  representing  and 
manipulating  constraints.  Constraints  are  linear 
symbolic  expressions  with  variables.  Constraint 
resolution  becomes  a  linear  programming  problem 
which  is  solved  by  a  simplex  algorithm.  The  method  of 
testing  satisfiability  of  constraints  is  complete.  Upper 
and  lower  bounds  can  be  determined.  The  uniformity 
of  time  and  space  representation  makes  it  attractive  to 
consider  space  and  time  together,  incorporating  special 
relativity  effects,  for  example. 

Objects  in  ACRONYM  are  defined  by  generic  functional 
classes,  not  by  their  geometric  form  [Blnford  82d], 
However,  only  generic  geometric  classes  have  been 
implemented.  We  define  object  classes  by  function,  but 
we  Identify  objects  visually  by  their  geometric 
structure.  [Lowry  82]  describes  reasoning  between 
structure  and  function.  Structure  is  represented  by 
kinematic  and  dynamic  relations  among  geometric 
forms  represented  as  generalized  cylinders.  A  key  issue 
is  representation  of  function.  The  relation  form  ■ 
function  provides  a  strong  argument  for  representing 
function  of  most  objects  in  terms  of  abstract  structures, 
i.e.  by  abstracted  forms  of  kinematic  and  dynamic 
relations  among  abstracted  geometric  forms.  The 
mechanism  of  abstraction  is  based  on 
partially-specified  generalized  cylinders  [Blnford  8 2d). 
These  mechanisms  require  the  introduction  of  physics 
knowledge  which  presupposes  the  introduction  of 
time. 

Introduction  of  le  rntng  in  ACRONYM  has  a  strong 
practical  focus  for  applications.  Programming  of 
applications  requires  the  building  of  geometric  models 
and  the  input  of  background  knowledge  Including 
knowledge  of  experts  in .  image  analysis.  The 
knowledge  base  required  for  general  vision  systems  can 
be  very  large.  Our  success  in  simplifying  input  to 
vision  systems  rests  on  success  in  non-trlvial 
structural  learning,  i.e.  learning  about  objects, 
relations,  and  programs.  An  essential  pert  of  learning  Is 
relating  function  and  formi  since  fundamental 
definitions  or  object  classes  are  functional,  this  means 
defining  the  fora  of  object  classes  as  abstract  shapes. 
This  research  is  intended  to  provide  effective 
mechanisms  for  "letting  the  machine  do  the  work"  in 
building  large  dau  bases  required  for  interpretation. 
There  has  been  much  talk  and  little  research  in  this 
areat  unrealistic  hopes  should  not  be  raised  about  when 
these  capabilities  will  be  available,  learning  of  data  Is 


more  usuali  learning  of  procedures  is  central  to  expert 
systems  for  image  understanding.  There  are  two  themes 
to  this  approach  to  learning.  The  first  theme  is 
specialization  to  structure  and  function.  We  believe 
that  a  functional  understanding  of  a  system  is  an 
economical  way  to  understand  a  great  body  of 
structural  information  about  the  system.  The  utility  of 
this  idea  depends  on  whether  it  is  simpler  to  build  a 
data  base  of  functional  Information,  or  to  input  directly 
the  structural  Information  needed  for  interpretation.  If 
functional  Information  is  compact  in  useful  domains, 
this  approach  will  succeed.  We  are  convinced  that  it 
will  succeed.  For  example,  knowing  the  dimensions  of 
the  human  body  constrains  the  dimensions  of  most 
cultural  articles  used  by  individuals,  like  desks,  chairs 
tables,  passenger  compartments  of  cars,  cups,  etc. 
Physical  considerations  constrain  transportation 
objects,  like  shape  of  aircraft.  Cost  and  construction 
constrain  shape,  size,  and  materials. 

The  second  theme  is  learning  causal  or  criterial 
relations  in  distinction  to  statistical  correlations, 
because  causal  relations  define  decision  criteria  which 
are  dependable.  In  contrast,  decisions  based  on 
statistical  distributions  are  often  unreliable  because 
they  assume  random  samples  and  typically  have 
systematic  bias.  It  is  usually  convenient  to  present 
biased  samples  for  teaching,  in  fact  is  quite  difficult  to 
avoid  bias  in  selecting  samples  of  data,  and  difficult  to 
obtain  large  samples  of  data.  For  example,  in  locating 
airfields,  the  length  of  an  airfield  is  determined  by  the 
takeoff  and  landing  distance  of  aircraft  which  it 
serves.  Knowledge  of  function  of  the  airfield,  which 
aircraft  it  serves,  allows  putting  strong  rather  than 
weak  constraints  on  the  length  of  runways.  Initial 
work  involves  using  ACRONYM  for  geometry, 
combined  with  a  system  for  reasoning  by  analogy  by 
Winston,  and  a  natural  language  input  system  by  Katz. 
Whether  it  will  be  feasible  to  integrate  these  systems 
remains  to  be  seen. 

Segmentation 

[Baker  82]  describes  a  stereo  system  which  is  primarily 
edge-based,  but  uses  image  intensities  for  interpolation 
of  surfaces  between  edges.  The  system  uses  constraints 
from  [Arnold  80]  along  epipolar  lines  in  a  Viterbi 
search  procedure  to  find  best  correspondence  on  a 
line-by-line  basis.  Combinatorics  of  the  solution  are 
reduced  by  using  a  coarse-fine  search  procedure,  as 
introduced  by  [Moravec  1877].  Continuity  of  edges 
between  epipolar  lines  provides  a  strong  constraint  on 
solutions,  as  demonstrated  by  [Arnold  77], 

In  support  of  research  on  stereo  vision,  a  survey  of 
stereo  vision  systems  and  supporting  vision  technology 
was  conducted.  Of  course,  vision  mechanisms  relevant 
to  stereo  Include  nearly  all  of  vision.  Thus,  the  scope  of 
the  study  was  very  broad  [Blnford  82c].  The  survey 
took  the  form  of  a  set  of  topics,  each  with  a  critical 
overview  summarizing  the  state  of  the  art  and  the 
prindpel  important  ideas  with  general  value.  In  each 
topic  were  a  set  of  critical  reviews  of  major  papers. 
Over  two  hundred  papers  were  reviewed.  A  large 
bibliography  was  assembled. 


We  axe  working  to  build  an  advanced  stereo  system 
integrating  constraints  at  all  levels  in  a  rule-based 
system. 

ACRONYM  interprets  Images  at  the  level  of  ribbons, 
matching  predicted  ribbons  with  those  observed.  An 
improved  ribbon  finder  is  under  construction 
[Marimont  82],  The  ribbon  finder  has  several 
components,  edge  finding  based  on  lateral  inhibition, 
curve  linking,  curve  segmentation,  and  grouping.  The 
edge  finding  module  uses  lateral  inhibition  to  remove 
the  effects  of  smooth  shading.  The  module  detects  edges 
as  above-threshold  gradients  in  the  signal  after  lateral 
inhibition.  Edges  are  localized  by  the  maximum  of  the 
gradient,  equivalent  to  the  zero  crossing  of  the  second 
derivative  of  the  signal  after  lateral  inhibition  [Binford 
81],  Edges  are  linked  in  four  neighbor  connectedness  in 
a  single  raster  scan  through  an  image.  Curve 
segmentation  is  performed  at  extrema  of  curvature  of 
linked  curves.  The  representation  of  curves  can  be 
regarded  as  splines  for  which  knots  are  well-chosen, 
i.e.  knots  are  chosen  at  discontinuities.  Splines  may 
have  discontinuities  in  position  and/or  tangent  at 
corners.  Other  knots  are  added  as  necessary;  at  these 
knots,  splines  are  continuous  in  position  and  tangent. 
Cubic  splines  are  used  as  a  basis.  The  forming  of  ribbons 
is  not  yet  complete.  It  is  based  on  a  few  fundamental 
operations  including  continuity  of  curves  and 
translational  invariance  from  [Nevatia  74]. 

We  are  conducting  research  to  build  a  fundamental 
underpinning  for  mechanisms  of  ribbon  finding  and 
related  grouping  operations.  Now  ACRONYM  matches 
single  ribbons,  then  pairs  of  ribbons.  This  process  is 
combinatorial.  Pairing  of  edges  in  ribbons  allows  many 
potential  combinations,  e.g.  lanes  of  a  highway.  We 
seek  to  cut  these  combinatorics  or  matching  by 
selecting  groups  to  match  which  are  well-chosen.  One 
aspect  of  the  problem  is  determining  what  image 
structures  provide  evidence  of  objects  in  three-space. 
This  is  related  to  the  classical  figure/ground  problem. 
It  is  considered  as  an  extension  of  inference  rules  for 
interpreting  Images  as  surfaces  [Binford  81],  Ribbons 
are  an  example  of  area  structures;  they  appear  as 
projections  of  generalized  cylinders.  Junctions  or  stars 
composed  of  curves  or  of  ribbons  are  another  clan  of 
image  structures.  Even  though  the  rules  with  vertices 
have  wide  application  and  considerable  generality,  new 
results  are  expected  to  apply  in  a  much  broader  class  of 
cases  [Binford  82b]  Another  aspect  is  locating 
regularities  in  images  without  relating  them  to  surface 
interpretations  [Lowe  82],  The  principle  that  the  world 
perceived  should  be  independent  of  the  observer  leads 
to  scale  Invariance  in  addition  to  position  and  rotation 
invariance.  This  implies  that  grouping  operations 
should  be  performed  at  all  positions,  all  orientations  if 
directional,  and  all  scales.  Structures  can  occur  at 
multiple  levels;  in  the  ocean,  for  example,  the  water 
surface  is  described  by  ripples  and  eddies  superimposed 
on  waves  at  one  scale,  described  as  flat  at  a  larger  scale, 
and  circular  at  the  scale  of  the  earth.  The  principle  that 
allowable  computational  complexity  is  limited  implies 
that  only  the  lowest  complexity  grouping  operations 
can  ha  carried  out.  In  particular,  complexity  linear  In 
the  number  of  feature#  implies  grouping  of  each 
adamant  with  a  constant  number  of  other  elements. 


Here  we  intend  grouping  by  proximity  and 
neighborhood  relations.  Diameter-limited  grouping  is 
not  effective  here;  what  is  appropriate  is 
complexity-limited  grouping  up  to  the  limit  of  number 
of  neighbors. 

A  measure  is  defined  of  the  likelihood  of  random 
occurrence  of  constellations,  using  non-parametric 
statistics,  without  a  priori  knowledge  of  distributions. 
An  example  Is  described  of  determining  linear  features 
in  dot  patterns. 
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-"'fn  this  series  of  Image  Understanding  Workshop  Pro¬ 
ceedings,  we  have  stressed  the  issue  of  representation.  In 
particular,  we  have  described  the  development  by  Horn  and 
his  colleagues  of  the  reflectance  map,  the  albedo  image,  and 
the  Gaussian  image,  and  we  have  described  the  work  of  the 
group  founded  by  Marr  using  the  primal  sketch,  the  2%  D 
sketch,  and  axis  based  3  -D  models. 

In  the  April,  19$  1  Proceedings,  we  reviewed  work  on 
computing  shape  from  shading  and  occluding  boundaries, 
including  a  local  parallel  algorithm  for  doing  this  due  to 
Ikcuchi  and  Horn  jl9$lj;  the  detection  and  perception  of 
motion  by  computing  the  optical  flow  and  directional  selec¬ 
tivity  of  zero  crossings;  the  interpolation  of  curves  and  sur¬ 
faces:  the  real-time  convolution  of  images  with  a  difference- 
of-Canssians  (DOC)  operator;  and  progress  toward  comput¬ 
ing  the  full  primal  sketch. 

Here  we  review  work  on  stereo  to  facilitate  the  com¬ 
putation  of  depth  information  and  visible  surface  charac¬ 
teristics,  the  detection  and  interpretation  of  motion,  the 
interpolation  and  description  or  visible  surfaces,  the  descrip¬ 
tion  of  two-  and  three-dimensional  shapes,  real-time  conv¬ 
olution,  and  shape  from  shading. 

1.  Stereo 


We  have  previously  described  the  theory  and  implemen¬ 
tation  of  Marr  and  Poggio’s  theory  of  human  stereo  [Marr 
and  Poggio,  1979;  Grimson  1981a,  1981b).  The  left  and 
right  images  are  first  convolved  with  a  number  of  DOG 
masks  and  the  zero  crossings  are  then  matched  using  a 
coarse-to-fine  strategy  We  have  made  a  number  of  mod¬ 
ifications  to  the  published  algorithm,  and  arc  testing  the 
current  version  on  natural  serial  images.  The  published 
algorithm  incorporates  a  continuity  constraint  that  is  ap¬ 
plied  over  horizontal  slices  of  (rectified)  images.  Individual 
horizontal  zlices  are  treated  separately.  Taking  note  of 
the  work  of  Mayhew  and  Fnsby  |1981|  and  Baker  and 
Binfnrd  (1981),  we  have  developed  a  continuity  constraint 
that  checks  for  consistency  along  zero-crossing  contours 
ifcat  typically  correspond  to  a  single  physical  edge.  We  have 
also  investigated  the  sensitivity  of  the  algorithm  to  vertical 
disparity  and  other  image  distortions. 


We  have  discovered  further  Psychophysical  support  for 
the  stereo  theory.  It  was  shown  by  Julcsz  that  random-line 
steieograms  with  tiny  breaks  (in  the  vernier  acuity  range) 
can  be  successfully  fused.  This  has  hcen  interpreted  as 
suggesting  that  (lie  interpolation  process  underlying  hyper- 
acuity  is  parallel  with,  and  preliminary  to,  stereo  matching. 
Nishihara  and  Poggio  )  1 082]  have  demonstrated  that  ver¬ 
nier  cues  are  not  needed  to  perform  stereo  matching  since 
zero-crossings  contain  sufficient  information. 

The  current  version  of  the  stereo  algorithm,  including 
the  Marr-Hildrcth  theory  of  edge  detection,  is  being  tested 
on  a  variety  of  aerial  images.  Two  of  these,  supplied  by 
the  Defense  Mapping  Agency  and  the  University  of  British 
Columbia  (UBC)  arc  of  natural  terrain.  A  third  stereo  pair, 
also  supplied  by  UBC,  is  of  a  building  complex  and  features 
a  deep  quadrangle  surrounded  by  high  rise  offices.  The 
fourth  pair,  supplied  by  Boeing  Corporation,  is  of  a  com¬ 
plex  highway  intersection.  Preliminary  results  of  running 
the  stereo  algorithm  on  these  images  arc  encouraging,  and 
suggest  that  the  algorithm  is  robust  and  does  not  require 
tuning  for  different  classes  of  stereo  pairs. 

Recently,  we  have  begun  experimentation  with  a  novel, 
fast  stereo  algorithm  due  to  Nishihara  that  gives  limited 
information.  It  may  have  important  practical  applications. 
Suppose  that  the  two  cameras  are  focused  on  a  plane  (called 
the  focus  plane).  The  matcher  rapidly  determines  whether 
a  zero-crossing  is  in  front  of,  on,  or  behind  the  focus  plane. 
Implemented  in  LISP  machine  microcode,  the  algorithm 
takes  about  2  seconds  to  process  a  1000  pixel  squared  image. 

We  have  investigated  the  mathematical  relationship  be¬ 
tween  the  Marr- Poggio  theory  of  stereo  and  Horn’s  work  on 
shape  from  shading.  Grimson  (forthcoming]  has  shown  that 
if  the  reflectance  map  (Horn  and  Sjoberg  1979]  is  known, 
then  given  a  pair  of  stereo  matched  depth  contours  it  is 
possible  to  determine  the  surface  normal  along  the  depth 
contour.  The  proof  suggests  a  technique  for  finding  sur¬ 
face  normals  that  is  essentially  analogous  to  photometric 
stereo,  pioneered  by  Horn,  Woodham,  and  Silver  [1978]. 
Conversely,  it  is  possible  in  principle  to  determine  cer¬ 
tain  visible  surface  characteristics  from  stereo  information. 


21 


i 

4j 


Suppose  that  the  reflectance  map  is  of  the  form 

K(n)  =  p[(  1  —  a)(n  —  s)  +  a(n  —  h)fc], 

where  p  is  the  albedo,  a  determines  the  convex  combination 
of  the  specular  and  matte  components  of  the  reflectance, 
and  k  is  the  degree  of  specularity.  Assuming  k  is  known, 
it  is  possible  to  determine  p  and  a.  It  is  also  possible  in 
principle  to  estimate  k  over  an  area  of  the  image.  With  the 
separation  of  human  eyes,  the  technique  is  most  effective 
at  a  distance  of  about  one  meter.  However,  the  technique 
may  find  application  to  wide  angle  stereo. 

2.  The  detection  and  perception  of  motion 

The  last  Proceedings  contaius  a  description  of  the  al¬ 
gorithm  invented  by  (lorn  and  Schunck  (1981]  for  com¬ 
puting  optical  flow.  Optical  flow  is  the  distribution  of 
velocities  of  apparent  movement  caused  by  smoothly  chang¬ 
ing  brightness  patterns.  The  algorithm  works  well  on  syn¬ 
thetic  images,  especially  when  there  are  no  depth  bound¬ 
aries  in  the  scene.  It  also  appears  to  give  reasonable  results 
when  there  are  depth  boundaries,  though  the  errors  in  the 
flow  become  significant  at  the  boundary.  Schunck  is  con¬ 
tinuing  to  develop  the  algorithm,  to  make  it  more  robust, 
less  sensitive  to  noise,  aDd  applicable  to  a  wide  variety  of 
natural  images. 

Recently,  Bruss  and  Horn  [1981]  (sec  this  volume)  have 
proposed  a  method  for  interpreting  the  optical  flow.  The 
technique  is  applicable  to  automatic  passive  navigation. 
More  precisely,  a  method  is  proposed  for  determining  the 
motion  of  a  body  relative  to  a  fixed  environment  using  the 
changing  image  seen  by  a  camera  attached  to  the  body. 
The  optical  flow  in  the  image  plane  is  the  input,  while  the 
instantaneous  rotation  and  translation  of  the  body  are  the 
output.  If  optical  flow  could  be  determined  precisely,  it 
would  only  have  to  be  known  at  a  few  places  to  compute  the 
parameters  of  the  motion.  In  practice,  however,  the  meas¬ 
ured  optical  flow  is  rather  inaccurate.  It  is  therefore  advan¬ 
tageous  to  consider  methods  which  use  as  much  of  the  avail¬ 
able  information  as  possible.  Bruss  and  Horn  employ  a  least 
squares  approach  which  minimizes  an  appropriate  measure 
of  the  discrepancy  between  the  measured  flow  and  that 
predicted  from  the  computed  motion  parameters.  Several 
different  error  norms  have  been  investigated.  In  the  gener¬ 
al  case,  the  algorithm  leads  to  a  system  of  nonlinear  equa¬ 
tions  from  which  the  motion  parameters  may  be  computed 
numerically.  In  the  special  cases  of  pure  translator)-  or  rota¬ 
tional  motion,  use  of  the  appropriate  norm  yields  a  system 
of  equations  that  arc  solvable  in  closed  form. 

In  other  work  on  motion  detection,  Hildreth  and  Ullman 
[forthcoming]  have  developed  a  technique  that  combines 
features  of  Horn  and  Schunck's  work  on  optical  flow  with 
Marr  and  UHman’s  |1981]  work  on  directional  selectivity, 
in  which  motion  is  detected  from  the  temporal  changes  in 
zero-crossings.  Marr  and  Ullinan  suggested  that  the  initial 


computation  of  motion  takes  place  at  the  location  of  fea¬ 
tures  in  an  image,  in  particular  at  zero-crossiugs.  Due  to 
the  local  nature  of  the  measurement  of  motion  along  zero- 
crossing  contours,  the  motion  is  only  determined  perpen¬ 
dicular  to  the  zero-crossing  contour.  A  subsequent  process¬ 
ing  stage  is  required  to  integrate  the  local  measurements 
in  order  to  compute  the  component  tangential  to  the  zero¬ 
crossing  contour. 

Following  Horn  and  Schunck’s  work  on  optical  flow,  the 
integration  of  local  motion  measurements  is  based  upon  a 
definition  of  smoothness  or  measure  of  local  variation  in  mo¬ 
tion.  Hildreth  and  Ullman  have  explored  several  definitions 
of  smoothness.  It  turns  out  that  the  optimal  measure  of  lo¬ 
cal  variation  is  the  integral  of  the  absolute  value  of  dv/ds, 
where  s  dcuotcs  arclength  and  v  is  the  velocity  field.  It  can 
be  showo  that  there  is  a  unique  velocity  field  that  satisfies 
the  initial  motion  measurements  and  minimizes  the  measure 
of  variation  along  a  zero-crossing  contour.  We  are  now 
proceeding  to  implement  the  method. 

The  spatial  resolution  of  an  image  is  limited  by  the 
sampling  density  of  the  photosensitive  elements  in  the  sen¬ 
sor  and  by  noise,  image  motion  introduces  the  additional 
problem  of  temporal  resolution.  The  limiting  factors  are 
the  fraxne  rate  and  the  integration  time  of  the  photosensi¬ 
tive  elements.  This  is  of  little  consequence  for  a  stationary 
scene,  but  for  moving  targets,  it  poses  the  problem  of  mo¬ 
tion  smear. 

The  problem  of  high  spatio-temporal  resolution  can  be 
overcome  partly  by  using  better  sensors  with  larger  arrays 
and  higher  frame  rates.  There  are,  however,  technologi¬ 
cal  and  physical  limits  to  the  spatio-temporal  resolution 
that  can  be  achieved  in  this  manner,  since  increasing  the 
spatial  and  temporal  sampling  rate  reduces  the  number  of 
photons  per  sensor  element  per  cycle.  The  performance 
of  a  given  sensor  can  be  improved  by  appropriate  spatio- 
temporal  interpolation  schemes.  Using  such  interpolation 
processes,  the  human  visual  system  achieves  an  extremely 
high  spatio-temporal  resolution  compared  to  the  sampling 
density  of  the  photoreceptors  and  their  integration  time. 

There  are  various  methods  for  reconstructing  the  orig¬ 
inal  signal  at  high  resolution  by  interpolating  values  meas¬ 
ured  at  widely  spaced  intervals.  The  best  known  approach 
to  this  problem  is  based  on  the  Shannon  sampling  theorem 
and  on  its  various  extensions.  Although  it  is  usually  for¬ 
mulated  for  one-dimensional  signals,  its  extension  to  two 
dimensional  time-varying  images  is  straightforward. 

For  static  images,  interpolation  of  this  type  can  provide 
a  resolution  much  higher  than  the  original  sampling  grid. 
Since,  in  our  framework,  the  position  of  zero-crossings  is 
important,  Hildreth  and  Poggio  have  examined  the  prob¬ 
lem  of  interpolating  the  values  of  the  DOG  convolution 
in  order  to  obtain  precisely  the  location  of  zero-crossing*. 
Analytical  arguments,  supported  by  computer  experiments, 
have  shown  that  the  position  of  a  sero-crossing  can  be  in¬ 
terpolated  precisely  in  terms  of  very  simple  interpolation 
functions,  even  in  fact  by  linear  interpolation. 
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For  time- varying  images  the  situation  is  more  complex. 
In  the  classical  scheme,  interpolation  in  space  and  time  is 
performed  independently,  since  the  temporal  dependence  of 
the  input  is  not  constrained  in  any  way. 

More  effective  interpolation  schemes  are  feasible  if  gen¬ 
eral  constraints  about  the  nature  of  the  visual  input  are 
incorporated  directly  into  the  computation.  The  key  obser¬ 
vation  here  is  that  the  temporal  dependence  of  the  visual 
input  is  usually  due  to  the  movement  of  rigid  objects,  and 
such  motion  usually  has  a  nearly  constant  velocity  for  the 
short  time  and  distance  over  which  the  interpolation  process 
operates.  Under  this  constant-velocity  assumption,  Poggio 
has  shown  that  the  spatio-temporal  Fourier  spectrum  of  a 
moviug  image  is  strongly  constrained  leading  to  a  new  form 
of  the  sampling  theorem: 

Interpolation  schemes  based  on  the  constant  velocity 
assumption  exploit  the  equivalence  of  the  time  and  space 
variable.  From  the  point  of  view  of  filtering,  this  means  that 
spatial  and  temporal  interpolation  cannot  be  performed  in¬ 
dependently.  Interpolation  algorithms  based  on  this  ap¬ 
proach  could  achieve  high  spatio-temporal  resolution  for 
objects  in  motion,  as  long  as  the  constant  velocity  assump¬ 
tion  is  not  grossly  incorrect,  despite  low  spatial  and  tem¬ 
poral  sampling  rates.  High  positional  acuity  for  the  image 
features,  although  desirable  for  tracking  moving  targets,  is 
not  the  only  goal  of  spatio-temporal  interpolation.  A  filter 
that  correctly  interpolates  the  sampled  image  automatically 
avoids  any  defect  in  the  representation  of  the  image  since  it 
reconstructs  the  original  iupul.  It  avoids  in  particular  mo¬ 
tion  smear,  and  it  fills  in  eventual  gaps  either  in  space  or 
time,  wherever  or  whenever  the  sampled  input  is  missing. 

3.  The  interpolation  and  representation  of 
surfaces 

In  our  last  progress  report,  we  described  our  work  on 
interpolating  curves  and  surfaces.  Interpolation  is  an  im¬ 
portant  problem  for  vision,  since  several  visual  processes, 
notably  stereo  and  structure  from  motion,  only  specify 
depth  and  orientation  at  a  discrete  subset  of  the  points 
in  an  image.  The  points  at  which  they  are  specified  are 
typically  those  where  the  irradiancc  changes  abruptly.  In 
the  Marr-Hildreth  theory  of  edge  detection,  such  points 
coincide  with  the  zero-crossings  of  a  DOG  operator  ap¬ 
plied  to  the  image.  Human  perception,  however,  is  of 
complete,  piecewise  smooth,  surfaces,  and  such  complete 
surface  information  is  important  for  most  applications  of 
vision.  Since  mathematically  the  class  of  surfaces  which 
could  pass  through  the  known  boundary  points  provided 
by  stereo,  for  example,  is  infinite  and  contains  widely  vary¬ 
ing  surfaces,  the  visual  system  must  incorporate  some  addi¬ 
tional  constraints  in  order  to  compute  the  complete  surface. 

Using  the  image  irradiancc  equation  formulated  by 
Horn  (1978),  Grimson  (1982]  has  derived  a  surface  consis¬ 
tency  constraint,  informally  known  as  "no  news  is  good 
news".  The  constraint  implies  that  the  surface  must  agree 


with  the  information  from  stereo  or  motion  correspondence, 
and  not  vary  radically  between  these  points.  An  explicit 
form  of  the  surface  consistency  constraint  has  been  derived, 
by  relating  the  probability  of  a  zero-crossing  in  a  region  of 
the  image  to  the  variation  in  the  local  surface  orientation 
of  the  surface,  provided  that  the  surface  albedo  and  the 
illumination  are  roughly  constant. 

A  second  idea  is  that,  in  the  absence  of  contrary  evid¬ 
ence,  the  visual  system  constructs  the  most  conservative 
curve  or  surface  consistent  with  the  given  sparse  data.  This 
is  made  precise  using  the  calculus  of  variations.  A  crucial 
aspect  of  the  variational  formulation  is  the  choice  of  perfor¬ 
mance  index  to  minimize.  Grimson  [1981  b]  argued  that  the 
performance  index  should  be  a  seminorm,  and  suggested 
the  quadratic  variation  / T~  2/J  +  /])„■  Brady  and  Horn 
(198 1 ]  (sec  also  this  volume)  have  noted  that  any  quadratic 
form  in  /„,  J1V,  and  /My  is  a  semi  norm,  and  so  is  a  plausible 
performance  index.  They  have  shown  that,  the  quadratic 
forms  that  arc,  in  addition,  rolalionally  invariant  form  a 
vector  space,  which  has  the  square  Laplaciau  and  the  quad¬ 
ratic  variation  as  a  basis.  Since  the  quadratic  variation  has 
the  smaller  null  space,  it  offers  the  lighter  constraint,  and  is 
to  be  preferred.  Brady  and  Grimson  (1981]  have  suggested 
that  surface  perception  is  the  basis  for  the  the  perception 
of  subjective  contours. 

Brady  and  Horn  [1981]  suggest  that  surface  interpol¬ 
ation  can  be  posed  in  terms  of  a  physical  model,  namely 
as  the  variational  problem  describing  the  constrained  equi¬ 
librium  slate  of  a  thin  flexible  plate.  The  variational 
problem  and  the  physical  model  have  been  developed  by 
Tcrzopoulos  (1982].  After  formulating  surface  interpol¬ 
ation  as  an  energy  minimizing  problem  over  an  appropriate 
Sobolev  space,  the  problem  is  discretized  and  approached 
via  the  finite  element  method.  In  essence,  the  variational 
problem  is  transformed  into  a  large  set  of  linear  algebraic 
equations  whose  solution  is  computable  by  local- support, 
cooperative,  parallel  processors. 

It  has  been  suggested  that  visual  processes  such  as  edge 
detection  and  stereo  provide  information  at  a  number  of  dis¬ 
tinct  scales,  spanning  a  range  of  resolutions.  To  exploit  the 
information  available  at  each  level  of  resolution,  a  hierar¬ 
chy  of  discrete  problems  is  formulated  and  a  highly  efficient 
multi-level  algorithm,  involving  both  intra-level  relaxation 
processes  and  bi-directional,  inter-level,  local,  interpolation 
processes,  is  applied  simultaneously  to  discover  the  solu¬ 
tion.  Intra-lcvcl  relaxation  smooths  out  high  frequency 
variations,  while  inter-level  interpolation  tends  to  damp 
out  low  frequency  variations,  greatly  speeding  the  overall 
process.  The  resulting  process  is  extremely  efficient,  even 
on  a  serial  computer,  though  it  is  better  suited  to  an  ar¬ 
ray  or  parallel  processors.  Current  work  is  concentrating 
on  the  isolation  of  surface  discontinuities  and  the  integra¬ 
tion  of  the  results  of  several  visual  processes  such  as  shape 
from  shading,  structure  from  motion,  and  stereo,  toward  a 
deeper  understanding  of  the  structure  of  the  2$-D  sketch. 


Once  complete  surfaces  have  been  interpolated  from 
sparse  data,  it  is  necessary  to  describe  them,  for  example 
to  facilitate  recognition  or  inspection.  Horn  [1982]  has  sug¬ 
gested  the  Gaussian  image  as  a  representation  of  surface 
shape.  Local  surface  normals  arc  brought  together  at  an 
origin.  Parallel  normals  arc  represented  by  a  vector  in  their 
common  direction.  The  magnitude  of  the  vector  is  propor¬ 
tional  to  the  number  of  normals  sharing  that  orientation. 
A  theorem  of  Minkowski  shows  that  the  representation  is 
faithful  for  convex  objects,  in  the  sense  that  the  original 
surface  can  be  recovered  from  the  Gaussian  image.  The 
scheme  is  currently  being  implemented. 

In  other  work  on  the  same  problem,  Brady  [1982c]  has 
proposed  a  representation  of  visible  surfaces  based  on  cur¬ 
vature  patches,  which  are  surface  patches  similar  to  those 
used  in  computer-aided  design  (CAD),  but  which  differ  in 
two  respects.  First,  the  webbing,  or  surface  parameterisa- 
tion,  is  required  to  consist  of  (a  suitable  tesselation  of)  lines 
of  curvature  on  the  surface.  Second,  the  blending  function 
based  on  the  approach  to  surface  interpolation  developed 
by  Grimson,  Horn,  Brady,  and  Tcrzopoulos.  It  is  shown 
that  the  representation  is  complete,  and  that  it  has  a  num¬ 
ber  of  advantages  over  conventional  CAD  representations, 
such  as  bicubic  splines  or  Be ne r  surfaces.  Surface  intersec¬ 
tions  are  represented  in  a  way  that  generalizes  techniques 
associated  with  line  drawing  analysis,  and  is  related  to  the 
work  of  Binford  [1981] 

4.  Shape  description 

The  description  of  two-  and  thice-dimensionai  shape  is 
crucial  for  recognition,  llrady  (1982a,  1982b]  has  developed 
a  representation  or  two-dimensional  shapes  that  combines 
certain  features  of  two-dimensional  projections  of  gener¬ 
alized  cylinders  [Ncvatia  and  Binford  1977,  Brooks  1981] 
and  the  symmetric  axis  transform  (SAT)  [Blum  and  Nagel 
1978],  The  representation  has  four  components.  First,  local 
symmetry  is  defined  in  a  way  that  differs  from  that  implicit 
in  the  SAT.  Second,  axes  that  arc  smooth  loci  of  local  sym¬ 
metries  are  computed.  Third,  axes  whose  region  of  support 
is  wholly  subsumed  by  some  other  axis  are  deleted.  The 
resulting  smoothed  local  symmetries  are  given  a  parametric 
description  called  a  frame.  Finally,  consistent  frames  from 
adjacent  pieces  of  shape  are  propagated  to  form  an  overall 
shape  description. 

A  pilot  implementation  of  smoothed  local  symmetries 
has  been  constructed.  It  embodies  an  efficient  algorithm, 
baaed  on  the  mean  value  theorem,  for  determining  the 
points  at  which  a  line  entering  the  shape  at  a  given  orienta¬ 
tion  to  the  tangent,  emerges  from  the  shape.  The  represent¬ 
ation  has  been  applied  to  determine  where  to  chooee  grasp 
points  on  a  lamina  for  a  two-fingcred  robot  hand.  Currently 
the  representation  is  being  implemented  for  moothly  curv¬ 
ed  shapes  and  for  shapes  with  straight  edge*. 


The  implemented  system  extracts  the  bounding  con¬ 
tours  from  images  using  DOG  filters.  Recently,  however,  an 
improved  line  finder  has  been  developed  by  CanDy  [forth¬ 
coming].  Canny  has  shown  that  the  first  derivative  of 
a  Gaussian  is  the  filter  that  optimizes  the  product  of  a 
measure  of  the  detectability  of  edges,  and  a  measure  of 
their  localization.  The  operator  is  directional  and  operates 
at  a  number  of  scales.  An  operator  that  finds  lines  by  non- 
maxiina)  suppression  has  also  been  developed. 

5.  Reflectance  techniques 

Reflectance  techniques  can  be  applied  to  recover  descrip¬ 
tions  of  ground  cover  from  remotely  sensed  images.  Working 
with  Horn  [Horn  and  Sjoberg  1980],  Sjoberg  [1982]  has 
shown  that  certain  strong,  though  reasonable.  assumDtions 
permit  the  use  of  a  simple  parametric  image  forming  equa¬ 
tion  appropriate  to  satellite  sensing.  Topographic  effects 
can  be  determined  with  the  aid  of  a  digital  terrain  model 
for  the  area  imaged.  Atmospheric  effects  can  be  estimated, 
partly  from  the  image  itself  and  partly  by  tuning  the  model 
parameters  and  subjectively  evaluating  the  resulting  syn¬ 
thetic  albedo  images.  The  subjective  criteria  include:  no 
shading  artifacts  due  to  the  topography,  there  should  be  a 
close  match  between  the  sunlit  and  shadowed  areas  strad¬ 
dling  a  cast  shadow  boundary,  and  the  dynamic  range  of 
the  computed  albedo  is  limited.  This  kind  of  tuning  is 
necessary  if,  as  is  often  the  case,  no  additional  information 
about  the  scene  or  the  atmosphere  can  be  obtained. 

6.  Learning  physical  descriptions  from  func¬ 
tional  definitions,  examples,  and  precedents 

Working  with  Binford  and  Lowry  of  Stanford  Univers¬ 
ity,  Winston  and  Katz  have  developed  a  theory  of  learn¬ 
ing  that  explains  how  physical  descriptions  for  recognition 
can  be  generated  using  functional  definitions,  particular 
examples,  and  precedent  knowledge.  The  work  synthesises 
two  sets  of  ideas:  ideas  about  learning  from  precedents  and 
exercises  developed  by  Winston  at  MIT,  and  ideas  about 
physical  description  developed  in  the  ACRONYM  system 
at  Stanford. 
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Our  principal  objective  in  this  research 
program  is  to  obtain  solutions  to  fundamental 
problems  in  computer  vision  that  have  broad 
military  relevance,  particularly  in  the  areas  of 
cartography  and  photo  interpretation.  Current 
research  is  directed  towards  developing  automated 
high-performance  techniques  fur  stereocompilation, 
delineation  of  linear  features,  scene  partitioning 
(and  material  identification),  and  Image  matching 
(and  image  to  database  correspondence). 

In  addition  to  our  own  research,  we  have 
designed  and  are  implementing  an  integrated 
testbed  system  that  incorporates  the  results  of 
research  produced  throughout  the  image 
understanding  community.  This  system  will  provide 
a  coherent  demonstration  and  evaluation  of 
accomplishments  of  DARPA's  image  understanding 
program,  thereby  facilitating  transfer  of  this 
technology  to  appropriate  military  organizations. 

\ 

I  INTRODUCTION 

Research  at  SRI  International  under  the  DARPA 
Inage  Understanding  Program  was  Initiated  to 
Investigate  ways  In  which  diverse  sources  of 
knowledge  might  be  brought  to  bear  on  the  problem 
of  analyzing  and  Interpreting  aerial  Images.  An 
Initial  exploratory  phase  of  research  Identified 
various  means  for  exploiting  stored  knowledge  In 
the  processing  of  aerial  photographs  for  such 
military  applications  as  cartography, 
Intelligence,  weapon  guidance,  and  targeting.  A 
key  concept  la  the  use  of  a  generalized  digital 
map  to  guide  the  process  of  Image  analysis.  The 
results  of  this  earlier  work  were  Integrated  Into 
an  Interactive  computer  system  called  "Hawkeye" 

[1).  This  system  provides  necessary  basic 
facilities  for  a  wide  range  of  tasks  In 
cartography  and  photo  Interpretation. 

Research  subsequently  focused  on  development 
of  a  program  capable  of  expert  performance  In  a 
specific  task  doaaln — road  monitoring.  The 
primary  objective  of  this  work  has  been  to  build  a 
computer  system,  called  the  Road  Expert,  that 
~ under  stands'  the  nature  of  roads  and  road  events. 

It  Is  capable  of  performing  such  tasks  as 

*  The  research  described  In  this  paper  Is  based  on 
Agency  Contract  No.  NDA903-79-C-0588. 


*  Finding  roads  In  aerial  Imagery. 

*  Distinguishing  vehicles  on  roads  from 
shadows,  signposts,  road  markings,  etc. 

*  Comparing  multiple  Images  recorded  at 
different  times  with  symbolic  Information 
pertaining  to  the  same  road  segment,  and 
deciding  whether  significant  changes  have 
occurred. 

The  general  approach,  and  technical  details  of 
the  Road  Expert's  components  are  contained  In 
References  [2-8].  We  have  integrated  these 
separate  components  into  a  coherent  system  that 
facilitates  testing  and  evaluation,  and  have 
transferred  this  system  to  the  DARPA/DMA  Testbed. 

A  parallel  research  program  (described  In 
Reference  [7]),  jointly  supported  by  DARPA  and  NSF, 
has  complemented  the  above  Investigations  by 
focusing  on  fundamental  computational  principles 
that  underlie  the  early  stages  of  visual  processing 
in  both  man  and  machine. 

At  present  we  are  Involved  in  two  major 
efforts.  The  first  Is  In  support  of  a  joint 
DARPA/DMA  program  to  provide  a  framework  for 
demonstrating  and  evaluating  the  applicability  of 
Image  understanding  research  (from  throughout  the 
entire  IU  community)  to  military  problems  In 
general,  and  to  the  problems  of  automated 
cartography,  In  particular.  Our  plans  and  progress 
in  this  effort  (DARPA/DMA  Testbed)  are  described  In 
a  separate  paper  in  these  proceedings  (Reference 

[9D- 

The  goal  of  our  second  major  effort  is  to 
carry  out  a  broad  program  of  machine  vision 
research  —  specifically  in  the  areas  of  three- 
dimensional  terrain  understanding,  linear-feature 
analysis.  Image  partitioning,  and  Image  description 
and  matching.  This  research  program  has  been 
centered  on  the  concept  that  Image  Interpretation, 
except  In  the  simplest  situations.  Involves  a  form 
of  reasoning  ("perceptual  reasoning")  characterized 
by  the  need  to  integrate  information  from  multiple 
sources  that  are  typically  incommensurate  and  often 
erroneous  or  in  conflict. 

We  have  developed  a  number  of  new  techniques, 
and  even  complete  paradigms,  for  effecting  the 
knowledge  Integration  task.  These  new  techniques 
have  been  incorporated  in  the  more  focused  efforts, 
discussed  below,  which  address  significant  problems 
In  scene  analysis. 

work  performed  under  Advanced  Research  Projects 
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II  RESEARCH  PROGRESS  AND  ACCOMPLISHMENTS 

A-  Three-Dimensional  Compilation 

and  Interpretation 

The  problem  of  stereo  reconstruction  Is  almost 
synonymous  with  the  problem  of  machine  vision:  use 
of  Imaged  data  to  (geometrically)  model  a  sensed 
scene.  We  are  constructing  an  Integrated  system 
for  stereo  compilation  to  satisfy  two  goals:  first, 
to  provide  a  framework  fo.  the  development  and 
evaluation  of  a  nuraoer  of  our  own  efforts,  as  well 
as  those  of  the  IU  community.  In  scene  modeling; 
second,  to  advance  the  state  of  the  art  of  a 
critical  task  domain  in  cartography. 

A  key  concept  In  our  approach  is  the  use  of 
global  physical  and  semantic  constraints  (e.g.,  the 
sun's  location,  vanishing  points,  edge  detection 
and  classification,  skyline  delineation,  etc.)  to 
provide  a  means  of  resolving  local  ambiguities  that 
frustrate  conventional  stereo  matching  techniques 
in  mapping  cultural  or  urban  scenes.  Such  scenes 
contain  featureless  areas  and  large  numbers  of 
occlusion  edges  or,  alternatively,  are  represented 
by  widely  separated  or  oblique  views.  We  are 
taking  special  measures  to  make  our  system  modular 
so  that  critical  components,  embodying  work  done  at 
other  IU  centers  and  present  In  the  testbed,  can  be 
freely  substituted  for  existing  modules. 
References  10-14  present  a  more  complete  discussion 
of  our  work  In  this  research  area. 

B.  Detection,  Delineation,  and  Interpretation  of 

Linear  Features  In  Aerial  Imagery 

We  have  developed  a  system,  called  the  "Road 
Expert,”  that  can  precisely  delineate  roads  In  both 
high-  and  low-resolution  aerial  Imagery,  and  can 
classify  the  visible  objects  that  fall  within  the 
road  boundaries  [2-8].  A  demonstration  version  of 
the  Road  Expert  has  been  Installed  on  the  DARPA/DMA 
testbed.  We  have  Investigated  extensions  of  this 
work  to  the  problem  of  delineating  more  general 
types  of  linear  structures  and  are  presently 
developing  techniques  that  will  enable  the  system 
to  adjust  Its  own  parameters  to  optimize 
performance  over  a  variety  of  viewing  conditions 
and  terrain  types  without  the  need  for  operator 
Intervention.  Our  ultimate  goal  is  a  high- 
performance,  completely  autonomous  system  for 
linear  delineation. 

C.  Image  Matching  and  Inage-to- Database 

Correspondence 

We  have  developed  a  new  paradigm,  called 
Random  Sample  Consensus  (RANSAC),  for  fitting  a 
model  to  data  containing  a  significant  percentage 
of  gross  errors,  and  have  applied  this  paradigm  to 
the  solution  of  the  matching/correspondence  problem 
[15].  A  RANSAC-based  camera  model  solver  has  been 
developed  and  installed  on  the  testbed.  We  expect 
that  RANSAC  will  be  equally  applicable  to  a  wide 
range  of  other  model-based  Interpretation  tasks, 
and  are  presently  Investigating  its  use  In 
techniques  for  recognizing  and  labeling  known  two- 
and  three-dimensional  scene  features,  even  though 


seen  under  unusual  viewing  or  Illumination 
conditions,  and  even  when  the  objects  are  partially 
occluded. 

D.  Image  Partitioning,  Intensity  Modeling,  and 
Material  Identification 

Our  goal  In  this  effort  Is  to  develop 
techniques  for  partitioning  and  modeling  the 
material  composition  of  a  scene  from  available 
Imagery.  In  order  to  recover  Information  about 
actual  surface  reflectances  and  physical 
composition,  the  problem  of  Intensity  modeling  must 
be  addressed.  We  have  devised  methods  for  deriving 
absolute  scene-intensity  Information  In  the  absence 
of  calibration  data  (such  as  a  step  wedge  exposed 
on  the  Image),  based  on  knowing  the  Identity  of  the 
material  composition  of  the  surfaces  at  a  few 
locations  In  the  image  —  a  necessary  capability 
for  partitioning  the  Image  Into  labeled  regions  of 
a  given  material  type  (see  Reference  [11]). 
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ABSTRACT  The  system  we  are  constructing  will  search 

digitized  Images  for  instances  of  interesting 
This  paper  reviews  the  curreit  status  of  an  objects.  The  system  is  both  top-down 
ongoing  program  to  demonstrate  the  application  of  (prediction-driven)  and  bottom-up  (data-driven). 
DARPA  Image  Understanding  research  to  a  photo  Both  camera  and  illunination  models  are  included, 

interpretation  system  using  real  Imagery.  The  Image  and  shadow  shape  predictions  are  being 

program  is  based  on  the  ACRONYM  vision  system,  generated  symbolically  from  3-D  object  models.  A 

developed  by  Rod  Brooks  and  Tom  Binford  at  Stanford  sophisticated  symbolic  matching  process  is  being 

University  on  the  DARPA  Image  Understanding  used  to  interpret  line  finder  results.  Objects 

Project.  This  system  was  chosen  for  its  observed  in  a  sequence  of  images  are  related  to 

sophistication  and  potential  for  extension.  expected  event  sequences  via  a  situation  assessment 

ACRONYM  is  running  in  FHANZLISP  on  a  VAX  with  the  module. 

DARPA  IU  Testbed  environment.  Current  work  on  the 

project  involves  extending  ACRONYM  to  meet  the  This  paper  describes  the  progress  to  date  and 

specific  application  requirements.  These  status  of  the  Hughes  IU  project.  Section  2 

extensions  Include  Improving  the  robustness  of  the  summarizes  progress,  including  the  IU  component 

low  level  vlescription  module,  broadening  the  class  selection,  identification  of  necessary  extensions, 

of  information  generated  by  the  prediction  system  and  work  to  carry  out  these  extensions.  Section  3 

(including  the  ability  to  predict  shadows),  and  describes  work  in  progress.  Section  <1  gives  a 

adding  a  script  driven  situation  assessment  module.  prognosis  and  plans.  The  ACRONYM  system,  which  is 

This  latte,  module  will  provide  the  capability  to  the  basis  of  our  developaient  effort,  is  described 

understand  a  series  of  interpretation  results.  briefly  in  the  appendix. 

2.0  PROGRESS  TO  DATE 

1.0  INTRODUCTION  The  IU  applications  project  began  with  an 

extensive  evaluation  of  available  IU  software  in 
Research  in  Image  Understanding  (IU)  has  been  terms  of  the  application  requirements.  Both  IU 

conducted  for  the  last  20  years  at  several  testbed  software  at  SRI  and  software  at  other 

university  and  Industrial  laboratories.  In  recent  laboratories  were  surveyed.  Both  vision  components 
years,  the  major  focus  of  this  research  has  been  and  vision  systems  were  evaluated.  The  study  was 

the  DARPA  Image  Understanding  P roject .  Research  on  documented  in  a  report,  "IU  Algorithm  Report," 

both  vision  system  components  and  complete  vision  March,  1982.  The  ACRONYM  system  [3],  [S],  [5),  C6J 

systems  has  been  reported  in  the  proceedings  of  the  was  chosen  as  the  basis  for  the  demonstration. 

Image  Understanding  Workshops.  The  history  of  Ac-onym's  overall  flexible  implementation  provided 

these  developments  is  well  known  in  the  IU  an  excellent  basis  for  the  extensions  required, 

community.  An  IU  testbed  [7],  (83  has  been 

established  to  facilitate  use  of  this  research  in  ACRONYM  was  developed  by  Rod  Brooks  and  Tom 

applications.  Hughes  has  been  performing  on  a  Binford  at  Stanford  University.  Its  vision 

program,  supported  by  DARPA  and  ONR,  to  apply  knowledge  is  expressed  in  rules.  Its  functional 

research  results  from  the  DARPA  Image  Understanding  capabilities  Include  an  interpretation  system 

Project.  The  objective  of  this  program  !s  to  utilizing  a  3-D  model  hierarchy  and  a  powerful 

automate  parts  of  an  operational  constraint  system  to  guide  the  matching  process, 

photointerpretation  task.  The  approach  is  to  It  was  implemented  in  MACLISP,  ran  on  a  PEP-10,  and 

construct  a  system  from  selected  components  and  was  demonstrated  in  the  domain  of  interpretation  of 

demonstrate  Its  capabilities  on  real  imagery  using  aerial  photographs  of  wide-bodied  jet  aircraft, 

a  general  purpose  computer.  The  modular  implementation  is  based  on  a  core  of 


record  structures  and  control  constructs.  Other 
features  of  ACRONYM  are  summarized  in  the  appendix. 

The  application  objectives  require  a  system  with 
the  functions  indicated  in  Fig.  2.  Following  the 
Image/model  matching  process,  observed  objects  are 
related  to  symbolic  history  files  in  order  to  track 
a  sequence  of  activities.  This  step,  labeled 
Situation  Assessment  in  the  figure,  is  an  addition 
to  ACRONYM,  which  provides  the  image/model  matching 
activities. 

Analysis  of  the  600  pages  of  ACRONYM  source 
code  showed  that  several  extensions  to  the  ACRONYM 
system  are  needed  for  the  desired  application. 
These  extensions  fall  into  five  categories:  (1) 
more  complex  control  structure,  (2)  prediction  of 
shadows,  (3)  improved  extraction  and  matching  of 
line  segments,  (i|)  means  for  relating  observations 
from  a  sequence  of  images,  and  (5)  further 
application-level  development  tools.  These  five 
extension  areas  are  in  various  stages  of  detailed 
design  and  implementation.  The  status  of  each  area 
is  discussed  in  the  following  section. 

The  ACRONYM  system  is  running  in  FRANZ LISP  on 
a  VAX  1  1/780  at  Hughes.  FRANZ  LISP  itself  is 
supported  on  the  VAX  by  the  EUNICE  environment, 
provided  by  the  IU  Testbed.  The  computing 
environment  is  summarized  in  Fig.  1. 

3.0  WORK  IN  PROGRESS 

The  remainder  of  this  section  reports  the 
progress  and  status  of  the  five  extension  areas. 

3.1  SEQUENCING  AND  CONTROL 

Coordination  of  system  activities  is 
accomplished  by  a  set  of  sequencing  and  control 
rules.  Their  task  is  to  relate  model-driven 
(top-down)  predictions  and  data-driven  (bottom-up) 
information  derived  from  the  image.  The  topmost 
level  of  the  sequencing  and  control  rule  hierarchy 
is  shown  in  Fig.  3.  First,  the  current  image  is 
Interpreted.  Next,  these  results  are  related  to 
historical  data.  Finally,  results  from  the  image 
interpretation  and  situation  assessment  steps  are 
reported. 

Sequencing  and  control  of  the  image  ! 
interpretation  process  consists  of  two  major  steps. 
First,  the  rule  (SC-SREGISTRATION )  matches  thj 
image  scene  to  a  geographic  model. 

When  an  adequate  location  match  is  obtained, 
the  second  image  interpretation  step  identifies 
pre-selected  regions  of  interest  in  image 
coordinates.  These  regions  are  formed  into  a  list 
of  windows  by  rule  (SC-WK1DCW-SELECT) .  The  system 
then  Iterates  through  this  list  one  or  more  times, 
seeking  to  locate  and  identify  object  Instances.  A 
lower-level  tier  of  control  rules  directs  the 
search  within  each  window  and  uses  the  system 
object/model  hierarchy  to  identify  objects.  These 
rules  issue  as  many  calls  to  the  vision  system  as 
needed,  each  seeking  to  match  a  specific  object 
model  from  the  model  hierarchy  to  extracted  image 
features. 


Within  the  vision  system,  separate  rule 
subsets  control  the  prediction,  ribbon-finding,  and 
model  matching  steps.  The  prediction  step 
calculates  observable  surface  shapes  and  their 
relationships ,  for  both  objects  and  their  shadows. 
Line  and  ribbon  finding,  described  in  Section  3-3, 
utilizes  heuristic  rules  for  line  finder  parameter 
selection.  The  model  matching  process  matches 
prediction  graph  nodes  and  then  prediction  graph 
arcs . 

Control  within  the  situation  assessment  and 
report  generation  steps  is  sequential.  The  five 
situation  assessment  steps  are,  script  selection, 
event  prediction,  script  verification,  script 
updating,  and  script  inf  erenC'i  ng .  Report 
generation  is  initiated  by  the  rule 
(SC-REPORT -GENERATION) . 

The  design  of  the  high-level  structure  of  Fig. 
3  has  been  completed.  /In  addition  the  overall 
sequencing  and  control/  of  the  vision  system  is 
complete.  This  consists  of  the  ACRONYM  control 
structure  together  with  modifications  for  shadow 
prediction  (see  Section  3.2)  and  a  more 
sophisticated  shape  extraction  design  (see  Section 
3.3).  The  next  step  is  writing  and  checking  out 
the  control  rules. 

3.2  SHADOW  EXPLOITATION 

A  shade:-1  prediction  capability  Is  being  added 
to  Acronym  in  order  to  exploit  all  observed  image 
features.  The  approach  taken  is  to  analyze 
Acronym's  prediction  process  and  define  extensions 
needed  for  implementing  the  shadow  understanding 
capability. 

Objects  in  Acronym  are  modeled  as  combinations 
of  generalized  cones  [1],  [10],  Each  cone  gives 
ris^  to  a  set  of  planar  contours  which  correspond 
to  faces  of  a  cone  or,  in  the  case  of  a  curved 
surface,  its  projection  onto  a  plane.  Each  cone 
has  its  own  coordinate  system. 

Acronym's  prediction  process  has  three  major 
functions.  First,  given  a  specific  object  cone,  a 
camera,  and  their  locations  and  orientations  in  a 
world  coordinate  system.  Acronym  produces  a 
simplified  transformation  of  the  cone  into  the 
camera's  coordinate  system.  The  rotation  expression 
of  this  transformation  is  a  product  of  rotations. 
Acronym  predicts  how  these  individual  rotations 
distort  the  shapes  of  contours  which  are  expected 
to  appear  in  the  image.  Secondly,  Acronym  predicts 
spatial  relationships  between  the  subparts  of  an 
object,  l.e.  between  the  spines  of  the  cones 
comprising  the  object.  Finally,  Acronym  prediction 
provides  direction  for  the  low-level  descriptive 
process,  i.e.  the  ribbon  finder. 

Acronym’s  ability  to  predict  the  above  image 
features  is  mainly  limited  by  the  camera's 
orientation  with  respect  to  the  object.  With  help 
from  Rodney  Brooks,  additions  and  modifications  to 
the  system  are  being  made  to  allow  an  arbitrary 
camera  orientation.  Also,  Acronym  handles 
occlusion  only  in  the  case  where  the  faces  of  two 
subparts  of  an  object  are  hard  up  against  each 
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other.  The  ability  to  handle  other  eases  of 
occlusion  depends  heavily  on  combining  spatial 
information  obtained  from  predictions  of  different 
subparts  of  an  object  and  from  different  objects. 

The  Acronym  prediction  system  is  being 
extended  to  include  shadow  prediction.  Given  an 
illumination  direction,  the  system  will  first 
predict  whether  or  not  a  planar  surface  is 
illuminated.  (See  Fig.  4(a)  . )  In  the  case  of  a 
curved  surface,  the  location  of  the  illumination 
boundary  i3  predicted.  (See  Fig.  4(b).)  In  both 
cases  the  shadows  cast  by  the  object  onto  a 
background  plane  are  predicted.  Then,  the 
dimensions  of  both  the  shadowed  and  illuminated 
contours  of  the  object  are  predicted,  as  well  as 
the  dimensions  of  the  shadow  contours  on  the 
background  plane.  Next,  the  spatial  relationships 
between  object  subparts  and  their  shadows  are 
predicted  (Fig.  5).  Finally,  distortion  of  shadow 
contours  due  to  individual  rotations  is  predicted 
(Fig.  6). 

The  extensions  required  for  shadow  prediction 
are  in  a  state  of  detailed  design.  This  capability 
will  be  implemented  over  the  next  several  months. 

3.3  SHAPE  EXTRACTION 

To  a  great  extent,  the  performance  of  a 
practical  machine  vision  system  is  determined  by 
its  ability  to  reliably  extract  recognizable 
features  from  real-world  imagery.  In  order  to 
provide  the  ACRONYM  system  with  this  necessary 
element,  a  new  design  has  been  undertaken  to 
extract  two-dimensional  shapes  from  images. 

Much  research  has  been  directed  toward  the 
problem  of  linear  feature  extraction,  resulting  in 
several  edge  and  line  finding  algorithms  which 
perform  quite  well.  However,  much  less  progress 
has  been  made  in  techniques  for  extracting 
two-dimensional  shapes. 

There  has  been  same  recent  work  by  Mulgaonkar, 
Shapiro,  and  Harallck  t  9  3  in  which 
three-dimensional  objects  are  modeled  using  blobs, 
plates,  and  sticks  as  object  primitives.  However, 
the  goals  of  this  work  seem  aimed  toward  the  model 
representation  and  matching  problems,  rather  than 
the  specific  problem  of  shape  extraction.  Their 
extraction  of  two-dimensional  shapes  relies  upon 
simple  point  operators  such  as  thresholding  to 
segment  objects,  and  representation  of  the 
extracted  regions  as  near-convex  polygons. 

In  general,  region  or  blob  feature  extraction 
methods  seem  to  be  inadequate  for  application  with 
the  Acronym  system.  The  power  of  these  methods  is 
evident  in  the  relational  matching  that  can  be 
performed  between  these  features  and  three 
dimensional  models.  However,  features  of  this  type 
are  inherently  weak  in  terms  of  the  accuracy  of  the 
boundaries  of  the  extracted  objects,  a  key 
performance  requirement  for  application  with  the 
Acronym  system. 

One  approach  to  the  two-dimensional  shape 
extraction  problem  was  exemplified  by  Brooks  [2]  in 


his  Prototype  Edge  Mapping  Module,  which  formed  the 
shape  extraction  module  for  the  original  Acronym 
system.  This  approach  begins  by  extracting  linear 
edge  segments  with  one  of  the  existing  line  finding 
systems,  such  as  the  Nevatla-Babu  llnefinder  [11]. 
Using  this  segment  information,  the  shape 
extraction  system  attempts  to  trace  boundaries  of 
closed,  convex  two-dimensional  contours.  Segments 
are  selected  as  boundary  components  based  on 
connectivity  or  near-connectivity,  and  the  tracing 
function  is  able  to  bridge  colinear  gaps  and  join 
nearby  endpoints.  These  selected  segments  are  then 
subjected  to  certain  convexity  tests  to  determine 
that  they  are  tending  toward  formation  of  closed, 
convex  contours.  Finally,  the  traced  contour  is 
subjected  to  several  tests  of  dimensionality  and 
amount  of  enclosed  area  to  determine  if  these 
parameters  fall  within  some  model-driven  range  of 
expectation.  For  each  contour  which  survives  these 
tests,  a  directionality  histogram  is  computed,  and 
this  information  is  in  turn  used  to  describe  the 
extracted  shape  as  either  an  ellipse  or  ribbon. 
These  ellipse  and  ribbon  descriptions  are  directly 
analogous  to  the  shapes  predicted  by  Acronym's 
geometric  reasoning  system. 

This  method  was  able  to  perform  adequately  in 
some  applications.  However,  the  method  has 
limitations  for  the  desired  application.  One 
drawback  stems  from  the  fact  that  in  many 
applications  on  real-world  imagery,  only  a  few 
boundaries  of  any  object  will  be  visible  and 
detectable  by  the  segmentation  system.  Occlusions 
of  these  bounding  edges  frequently  occur  due  to 
other  objects,  shadows  cast  by  nearby  objects,  and 
perhaps  most  seriously,  self  shadowing  which  is 
bound  to  occur  in  real  imagery.  In  these  cases, 
attempts  to  trace  a  complete,  closed  boundary  seem 
doomed  to  failure.  Another  difficulty  is  that, 
once  the  contour  has  been  traced,  the  approximation 
of  a  ribbon  or  ellipse  based  on  directionality 
histograms  results  in  relatively  coarse  shape 
approximations.  While  this  drawback  is  not  serious 
in  some  applications,  it  does  limit  the  power  of 
the  vision  system  to  identify  objects  based  on 
detailed  dimensional  information.  Finally,  the 
trace  and  test  approach  makes  very  little  use  of 
information  about  the  shapes  which  were  actually 
predicted.  While  some  information  about  the 
expected  range  of  dimension  and  surface  area  is 
imbedded,  it  is  again  relatively  coarse,  resulting 
in  the  extraction  of  a  great  deal  of  clutter. 
Also,  the  shape  extraction  process  makes  no  use  of 
available  information  about  the  expected  spatial 
connectivity  of  various  predicted  object  subparts. 

The  design  of  a  next-generation  shape 
extraction  system  has  been  undertaken,  building  on 
Brooks'  Implementation.  Several  considerations 
have  driven  the  design  of  this  shape  extraction 
module,  with  the  goal  of  producing  a  system  which 
would  prove  powerful  enough  to  extract  useful 
features  yet  provide  robust  performance  in  a  wide 
range  of  applications. 

The  primary  consideration  lies  in  the  concept 
of  a  top-down,  model  driven  approach  to  shape 
extraction.  Acronym  predicts  constrained  instances 
of  certain  two-dimensional  shapes  from 
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three-dimensional  models,  given  some  collateral 
information  about  the  camera  nodel.  It  is  quite 
reasonable  to  provide  these  predicted  shapes  to  the 
shape  extraction  system,  and  allow  that  system  to 
search  for  specific  Instances  of  these  predicted 
shapes. 

By  providing  the  shape  extraction  mechanism 
with  this  detailed  information,  specific  instances 
of  candidate  shapes  can  drive  selective  searches 
for  instances  of  specific  segment-based  features. 
For  example,  if  a  ribbon  is  predicted  with  a 
certain  constrained  range  of  width,  length,  and 
sweep  characteristics,  the  shape  extraction  system 
can  search  for  instances  of  segments  which  lie  on 
parallel  lines  separated  by  a  distance  falling 
within  the  length  range  of  the  ribbon,  and  check 
that  these  segments  fall  within  the  proper  spatial 
separation  along  these  lines  to  potentially 
describe  pieces  of  the  end  segments  of  the  ribbon. 
Having  identified  some  set  of  candidate  segment 
pairs,  search  can  be  initiated  for  sepnents  which 
would  form  parts  of  the  side  segments  of  the 
ribbon,  with  the  possible  separation  and  angular 
relationship  determined  by  the  width  and  sweep 
characteristics  of  the  ribbon.  A  typical  matching 
situation  is  Indicated  in  Fig.  7. 

Note  that  there  is  no  need  to  find  complete 
sides  of  a  ribbon,  nor  is  it  necessary  to  find 
pieces  of  all  four  boundaries.  Rather,  candidates 
for  matching  may  be  selected  based  only  on 
suggestive  clues  indicating  the  existence  of  an 
instance  of  a  shape.  Created  along  with  this 
extracted  shape  would  be  some  measure  of  the 
quality  of  this  particular  shape,  in  terms  of  some 
comparison  of  the  perimeter  of  the  predicted  shape 
to  the  amount  of  boundary  found,  the  number  of 
sides  where  boundaries  were  supported,  and  the 
consistency  of  the  spatial  relations  of  the 
segments  found. 

Since  there  is  no  requirement  to  completely 
trace  the  boundaries  of  a  contour,  cases  of  partial 
occlusion  by  objects  or  shadqws  will  not  render  the 
vision  system  incapable  of  matching  these  shapes. 
Therefore,  the  system  should  perform  in  a  much  more 
robust  manner. 

In  addition  to  the  model-driven  approach  to 
shape  extraction,  the  new  design  also  has  knowledge 
of  the  spatial  connectivity  of  various  predicted 
shapes,  especially  shapes  which  represent  subparts 
of  a  single  modelled  object.  With  this 
information,  confidence  in  extracted  shapes  can  be 
influenced  by  their  connectivity  to  other  shapes. 
For  instance,  there  might  be  very  low  confidence  in 
two  shapes  extracted  from  a  small  number  of 
segments.  However,  if  both  of  these  shapes  fall  in 
the  correct  spatial  pattern  to  describe  the 
subparts  they  represent,  there  is  much  more 
confidence  in  the  results. 

Further,  this  knowledge  of  connectivity  can  be 
used  to  handle  cases  of  occlusion  of  one  subpart  by 
a  connected  subpart.  This  type  of  reasoning  is  of 
particular  advantage  in  situations  where  shapes 
projected  from  two  subparts  of  an  object  overlap  in 
such  a  way  as  to  be  seen  in  the  imagery  as  a 


single,  larger  shape  which  can  be  identified  by  the 
extraction  process. 

The  overall  design  of  the  new  shape  extraction 
module  is  shown  in  Fig.  8.  The  system  is  invoked 
from  the  executive  rules  which  control  the  overall 
operation  of  the  Acronym  vision  system.  The 
initial  rules  select  from  an  available  set  of 
shape-extraction  heuristic  rule  sets,  each  of  which 
may  be  expert  in  extracting  certain  types  of 
features  (ribbons,  ellipses,  linear  features,  etc.) 
or  may  specialize  in  shape  extraction  in  certain 
Imaging  conditions  (such  as  clear  vs.  hazy  Imagery, 
snowy  conditions,  etc.).  Any  one  or  any 
combination  of  these  expert  rule  sets  may  be  chosen 
by  the  selection  rule  set.  The  decision  is  guided 
by  the  shapes  themselves,  and  also  by  information 
provided  from  outside  the  vision  system  (by  the 
controlling  system  which  calls  the  vision  system  as 
a  sub-process),  providing  data  on  the  imaging 
conditions. 

As  each  heuristic  rule  set  executes,  the  rules 
will  need  to  gather  specific  types  of  edge-based 
features.  The  rules  will  be  capable  of 
parameterizing  and  invoking  a  line-finding  process 
(such  as  the  Nevatia-Babu  system)  or  some  other 
low-level  processing  or  enhancement  algorithms. 
The  resulting  edge  features  are  then  stored  on  a 
spatially  Indexed  data  base,  along  with  more 
complex  features  derived  from  this  edge  information 
by  the  spatial  indexing  rules.  These  more  complex 
features  include  parallel  lines,  line  pair 
vertices,  and  colinear  lines. 

Having  Invoked  the  rules  to  extract  edge  data 
and  create  a  library  of  features,  the  shape 
extraction  rules  access  this  data  through  requests 
to  a  set  of  rules  which  access  the  spatial  feature 
data  base.  These  rules  in  turn  are  influenced  by 
spatial  search  restrictions  which  are  imposed  by 
the  calling  routine.  These  search  restrictions  may 
be  used  to  guide  searches  for  instances  of  objects 
near  some  known  features  or  locations  (such  as 
searching  for  an  airplane  along  a  runway,  rather 
than  on  top  of  a  terminal  building).  The  segment 
selection  rules  search  the  spatial  feature  library 
to  provide  any  instances  of  the  requested  features 
within  the  restricted  search  area. 

Having  finally  Identified  a  particular  shape, 
this  shape  is  appended  to  a  data  structure  called 
the  Picture  Graph,  which  contains  all  shapes  to  be 
submitted  as  match  candidates.  Knowledge  of  the 
existence  of  this  particular  shape  Is  also  appended 
to  the  spatial  feature  library,  to  provide  cues  for 
other  shapes  which  may  be  predicted  to  be 
connected . 

In  all,  this  model-driven  approach  to 
two-dimensional  shape  extraction  is  expected  to 
provide  robust  extraction  performance  over  a  wide 
range  of  application  imagery. 

The  preliminary  implementation  of  the  design 
shown  in  Figure  8  is  completed.  For  the  desired 
application  there  will  be  considerable 
application-dependent  knowledge  embedded  in  the 
heuristic  rule  sets  to  control  the  details  of  shape 
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extraction.  This  specific  knowledge  will  be 
acquired  by  exercising  the  system  with  application 
imagery. 

3.4  SITUATION  ASSESSMENT 

In  the  desired  application,  besides  modeling 
the  geometry  of  the  objects  in  a  scene,  the  overall 
situation  must  be  modeled.  That  is,  the  actions  of 
the  objects  have  a  predictable  behavior  and  that 
behavior  is  modeled.  More  specifically,  the 
location  and  appearance  of  an  object  in  a  sequence 
of  images  reflects  its  behavior,  and  this  sequence 
of  observations  forms  a  predictable  history. 

Tracking  the  activity  of  objects  can  have  an 
impact  on  the  vision  system  by  confirming  or 
disconfirming  the  classification  derived  by  the 
vision  system.  If  the  vision  system  misclassifies 
an  object,  then  the  situation  assessment  has  a 
chance  to  catch  the  error  by  detecting  that  the 
object  is  not  behaving  in  a  normal  manner.  Also, 
the  ultimate  goal  of  the  system  is  to  keep  track  of 
the  activities  of  objects. 

The  method  of  monitoring  the  situation  is  to 
use  "scripts"  [12].  Scripts  provide  the  capability 
to  predict  future  behavior  and  infer  unobserved 
behavior.  The  scripts  provide  a  knowledge 
structure  to  match  against  observations.  The 
observations  are  matched  to  a  consistent  set  of 
scripts  for  all  objects.  The  script  is  flexible  in 
the  sense  that  not  all  events  of  the  script  have  to 
be  matched,  so  if.  the  object  is  not  observed  at  one 
stage  the  script  will  3till  match. 

In  the  desired  application,  there  are  several 
ambiguities  that  make  it  necessary  to  carry  along 
multiple  possible  interpretations  of  the  script. 
The  ambiguities  are  kept  until  only  one 
interpretation  is  consistent.  If  no 
interpretations  are  consistent,  then  an  anomalous 
situation  has  occurred,  and  it  is  reported. 

A  prototype  script  interpretation  system  has 
been  designed  and  implemented  on  a  Symbolics  LM-2 
LISP  machine.  Following  evaluation  and  refinement, 
the  script  system  will  be  translated  into  FRANZLISP 
and  rehosted  onto  the  VAX. 

3.5  SYSTEM  DEVELOPMENT  TOOLS 

As  the  need  for  modifications  and  extensions 
to  ACRONYM  became  clear,  so  did  the  need  for 
specialized  software  tools  to  aid  in  this  process. 
ACRONYM  already  possessed  several  tools.  These 
included  a  control  trace,  goal  tree  trace,  and 
(remarkably)  a  special  rule  set  which  helps 
determine  why  subgoals  not  satisfied  by  the  system 
failed.  However,  these  tools  were  personalized  to 
the  needs  and  preferences  of  the  original  system 
developer,  and  differed  from  our  broader  needs 
based  on  a  shallower  understanding  of  ACRONYM 
implementation  philosophy  and  history. 

In  keeping  with  our  primary  objective  of 
demonstrating  a  prototype  system  used  by 
programmers  (as  distinct  from  building  a  production 
system)  it  was  decided  to  add  only  minimal 


facilities.  A  ■-udimentary  explanation  capability 
has  been  added  in  the  form  of  the  ability  to 
interrupt  (break)  System  ope-ation  upon 
encountering  a  designated  rule  or  subgoal,  and  then 
ask  "How"  or  "Why"  the  system  was  attempting  what 
it  was  trying  to  do.  System  responses  consist  of 
subgoal  and  rule  names  and  will  soon  be  capable  of 
including  phrases  describing  the  purpose  of  each 
( rule-controlled)  step.  T-aces  can  be  switched  on 
or  off  by  "executing"  a  designated  rule  or  subgoal. 
The  ability  to  display  bindings  to  variables  and 
data  structures  is  also  planned.  These  facilities 
will  be  further  extended  as  specific  needs  arise. 

4.0  CONCLUSION 

At  this  point,  the  development  and 
demonstration  effort  is  going  smoothly,  and  we  a"e 
fairly  optimistic  about  achieving  our  objectives. 
We  expect  to  be  able  to  demonstrate  significant 
technical  contributions  in  three  areas:  shadow 
prediction,  model-directed  segment  extraction,  and 
tracking  of  temporally  related  observations  across 
a  sequence  of  images.  Good  progress  is  being  made 
in  all  three  areas. 
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Joe  Mitchell  provided  programming  support  by 
writing  the  Nevatia-Babu  linefinder  code  and  later 
rehosting  it  to  the  VAX. 


REFERENCES 

[1]  T.O.  Binford,  "Visual  Perception  by  a 
Computer,"  IEEE  Conf.  on  Systems  and 
Controls.  Miami  (December  1971). 

[2]  R.A.  Brooks,  "Goal-Directed  Edge  Linking  and 
Ribbon  Finding,"  Proceedings:  ARPA  Image 
Understanding  Workshop,  (April  1979)  pp. 
72-76. 

[3]  R.A.  Brooks,  R. Greiner,  and  T.O.  Binford, 
"The  ACRONYM  Model-Based  Vision  System," 
Proceedings  IJCAI-6.  Tokyo  (Aug.  1979).  PP. 
105-113. 

[4]  R.A.  Brooks,  "Symbolic  Reasoning  Among  3-D 
Models  and  2-D  Images,"  Artificial 
Intelligence  17.  (1961),  pp.  285-348. 

[5 ]  R.A.  Brooks,  "Symbolic  Reasoning  Among  3-D 
Models  and  2-D  Images,"  Ph.D.  Dissertation, 
Stanford  University,  1981. 

[6]  R.A.  Brooks,  ACRONYM  Manual,  unpublished. 


t7]  M.A.  Fischler,  "The  SRI  Image  Understanding 
Program".  Proceedings:  ARPA  Image 
Understanding  Workshop.  (April  1980),  pp . 
71-88. 

[8]  M.A.  Fischler  and  A.J.  Hanson,  "The  SRI 
Image  Understanding  Program,"  P  roceedings: 
ARPA  Image  Understanding  Workshop.  (April 
1981),  pp.  223-235. 

[9]  P.G.  Mulgaonkar,  L.G.  Shapiro,  and  R.M. 
Haralick,  "Recognizing  Three-Dimensional 
Objects  from  Single  Perspective  Views  Using 
Geometric  and  Relational  Reasoning," 
P  roceedings:  P RIP  ,  Las  Vegas  (1982),  pp. 
479_489. 

[10]  R.  Nevatia  and  T.O.  Binford,  "Description 
and  Recognition  of  Curved  Objects," 
Artificial  Intelligence  8.  (  1977  ),  pp . 
77-98. 

[11]  R.  Nevatia  and  K.R.  Babu ,  "Linear  Feature 
Extraction  and  Description,"  Computer 
Graphics  and  Image  Processing  13.  (1980). 
pp.  257-269. 

[12]  R.C.  Schank,  Conceptual  Information 
P  rocesslng.  New  York:  Nor t h-« o 1 1  a n d 
Publishing  Company,  1975. 


Appendix:  DESCRIPTION  OF  ACRONYM 

Acronym  can  be  viewed  as  two  major  components: 
a  basic  core  and  the  vision  system.  The  core 
consists  of  support  sub-sy3tem3  based  on  artificial 
intelligence  techniques,  including  a  record 
package,  a  rule  system,  and  a  constraint  system. 
This  "tool  kit"  is  very  general  in  nature,  and 
quite  independent  from  any  particular  problem 
domain.  Built  on  top  of  this  core  is  the  actual 
vision  system.  This  consists  of 
application-independent  modules  for  modeling 
three-dimensional  objects  and  matching  these  models 
to  shapes  extracted  from  two-dimensional  images. 
Specific  applications,  depending  on  their 
difficulty,  require  various  levels  of 
application-dependent  knowledge. 

a . 1  ACRONYM  CORE 

A  major  element  of  the  core  of  ACRONYM  is  the 
record  package.  It  provides  a  complete  set  of 
routines  to  handle  the  creation,  maintenance,  and 
access  to  a  wide  variety  of  data  structures.  The 
user  can  specify  the  form  of  each  record  structure, 
representing  data  either  as  simple  list  elements, 
or  as  named  properties  on  an  atom's  property  list. 
An  Interface  to  ACRONYM'S  LISP  top  level  allows 
nodes  of  these  data  structures  to  be  displayed. 
They  are  formatted  as  named  slots  with  the  data 
shown  as  fillers  for  these  slots.  Slot  fillers  can 
also  be  other  nodes  of  the  structure,  thus 
permitting  a  complex  graph  structure  to  be  formed . 

The  rule  system,  another  major  core  element, 
provides  the  user  with  a  means  to  write  production 


system  rules.  These  rules  are  written  to  conform 
to  ACRONYM'S  rule  format.  This  rule  format 
specifies  an  advertisement,  a  set  of 
pre-conditions,  and  an  executable  body.  Once  a 
rule  has  been  written,  it  is  translated  by  the  rule 
parser  into  LISP  source  code,  which  is  then 
compiled  and  loaded  into  the  system.  Also  included 
are  routines  to  initiate  rules,  trace  their 
execution,  and  help  the  user  debug  rule  sets. 

A  very  powerful  and  useful  element  of  ACRONYM 
is  in  the  constraint  system,  which  is  capable  of 
complex  algebraic  manipulations.  The  user  may,  for 
example,  define  some  variable  whose  value  is  known 
only  to  lie  within  certain  bounds.  One  element  of 
the  constraint  system  maintains  these  types  of 
constraints  in  a  normal  form,  using  the  record 
package  described  above.  As  the  user  (or  some  LISP 
function)  defines  these  types  of  constrained 
variables,  and  begins  to  use  them  in  algebraic 
equations,  the  constraint  system  will  invoke  other 
modules  which  perform  algebraic  simplification  and 
determine  bounds  on  values  of  the  expression.  As 
more  constraints  are  entered  or  determined,  the 
system  trie3  to  merge  these  constraints  to 
determine  tighter  constraints  either  on  variables 
or  expressions.  This  provides  a  powerful,  symbolic 
approach  to  algebraic  aspects  of  problem 
computations. 

a. 2  ACRONYM  VISION  SYSTEM 

The  vision  system  is  built  on  top  of  the 
ACRONYM  core  "tool  kit".  It  is  a  powerful  image 
understanding  tool  capable  of  matching 
three-dimensional  object  models  to  two-dimensional 
shapes  extracted  from  sensed  image  data.  Figure  9 
is  a  block  diagram  of  this  system,  viewed  as  four 
distinct  modules  which  operate  in  both  a  top-down 
and  a  bottom-up  fashion.  From  a  top-down 
perspective,  objects  of  interest  are  represented  by 
a  modeling  system.  From  these  models,  the 
prediction  system  symbolically  predicts 
two-dimensional  shapes  which  may  be  expected  to 
occur  in  imagery.  From  the  bottom  end,  the 
description  module  extracts  two-dimensional  shapes 
by  edge  detection  and  shape  determination.  These 
predicted  and  observed  shapes  are  then  matched  by 
the  interpretation  module. 

Three-dimensional  objects  are  represented  in 
ACRONYM  by  models  built  from  three-dimensional 
generalized  cones.  Each  cone  is  described  by  an 
axial  spine  along  vAilch  some  surface  is  swept.  The 
surface  may  be  varied  as  it  is  swept  along  the 
spine.  Some  examples  of  generalized  cones  are 
shown  in  Fig.  10. 


The  modeling  system  makes  full  use  of  the 
constraint  system,  allowing  the  creation  of  classes 
of  models,  with  generic  classes  having  very  loose 
constraints,  and  more  detailed  dimensions 
describing  specific  sub-classes.  An  example  might 
he  a  generic  class  of  land-based  vehicles,  with 
sub-classes  of  wheeled  vehicles  and  tracked 
vehicles,  and  more  detailed  sub-classes  of  trucks. 
Jeeps,  etc. 
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ACRONYM'S  prediction  nodule  Is  used  to 
transform  the  three-dimensional  cones  of  the  object 
models  into  two-dimensional  shape  surfaces,  such  as 
would  be  seen  in  actual  Imagery.  In  order  to  do 
the  necessary  two-dimensional  shape  predictions, 
the  predictor  requires  some  knowledge  of  the 
viewpoint.  In  ACRONYM  this  is  represented  as  a 
camera  model,  which  contains  Information  about  the 
position,  orientation,  and  focal  ratio  of  the 
sensing  camera. 

The  prediction  process  is  rule-based,  with 
specific  rule  sets  having  knowledge  of  the 
geometric  reasoning  processes  required  for  shape 
prediction.  Certain  rules  predict  the  classes  of 
shapes  that  may  be  seen  from  the  camera  viewpoint, 
while  others  predict  the  spatial  relationships 
between  these  shapes. 

The  prediction  process  also  makes  use  of  the 
constraint  system,  allowing  predictions  based  on 
Incomplete  information  about  the  camera  model. 


Predictions  are  also  constrained  by  uncertainties 
in  the  model  specifications.  The  result  of 
predictions  based  on  loosely  constrained  camera 
parameters  or  generic  model  classes  would  be 
general  descriptions  of  the  possible  shapes,  with 
constrained  ranges  of  dimensional  values  and 
spatial  relationships. 

The  description  module  in  ACRONYM  attempts  to 
extract  edge-based  shape  features  from  imagery. 
Edge  detection  is  performed  by  the  Nevatia-Babu 
linefinder  developed  at  DSC.  From  this  edge 
segment  information,  the  ribbon  finding  process 
attempts  to  form  closed,  convex  shapes.  These  are 
then  described  as  "standard"  shape  description 
corresponding  to  the  types  of  shapes  generated  by 
the  prediction  process. 

Finally,  the  interpretation  module  matches 
shapes  found  by  the  description  process  to  the 
predicted  shapes,  and  reports  these  matches. 
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Figure  3.  Top-Level  Control  Rule  Hierarchy. 
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(a)  A  PLANAR  SURFACE  IS  SHADOWED 
WHEN  THE  ANGLE  BETWEEnTaND 
THE  SURFACE  NORMAL  IS<  90° 


(b)  SHADOW  BOUNDARY  IS  WHERE 
TaND  THE  SURFACE  NORMAL 
ARE  PERPENDICULAR 


t  IS  THE  VECTOR  REPRESENTING 
THE  ILLUMINATION  DIRECTION 


Figure  4. 
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COUNEAR  SPINES  (CONNECTED  ARCS) 


NON-COLINEAR  SPINES 
(ANGLE  ARCS) 


•  SPATIAL  RELATIONS  BETWEEN  SPINES  OF  SUBPARTS  ARE  PREDICTED 

•  SPATIAL  RELATIONS  BETWEEN  THE  SPINES  OF  OBJECTS  AND  THEIR 
SHADOWS  WILL  ALSO  BE  PREDICTED 


Figure  5. 
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•  HUERISTICS  NEED  KNOWLEDGE  OF  SHAPE,  WIDTH,  LENGTH,  ETC 
TO  "FILL  IN"  MISSING  LINES 

•  SPATIAL  INDEXING  CAN  EFFICIENTLY  LOCATE  PARTIAL  SHAPES 
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TYPICAL  SEGMENTS 


Figure  7.  Model-Driven  Shape  Extraction. 
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Figure  9.  Aeronya  Vision  Systea. 
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xj  ABSTRACT 

A  novel  Residue  Arithmetic  Digital  Image 
Understanding  System  (RADIUS)  is  described.  This 
represents  an  application  of  VLSI  technology  to 
real-time  image  understanding.  Extensive  use  is 
made  of  high  speed  programmable  look-up  tables  to 
perform  a  wide  variety  of  functions  on  a  sliding 
5x5  kernel.  Residue  arithmetic  techniques  minimize 
look-up  hardware  and  provide  effectively  concurrent 
arithmetic.  An  interface  to  the  PDP  UNIBUS  has 
been  developed.  Preliminary  results  of  a  study  to 
determine  functional  requirements  of  a 
complementary  logic  processor  are  presented . 


INTRODUCTION 

The  advent  of  Large-Scale  Integrated  circuit 
technology  has  radically  changed  the  philosophy  of 
computer  design  and  engineering.  Until  recently, 
because  of  the  extraordinarily  high  cost  of 
computer  development,  machines  were  typically  aimed 
at  the  widest  possible  user  base.  This  has  led  to 
the  almost  exclusive  development  of  the  Von  Neumann 
type  of  architecture,  which  is  currently  in  use  in 
general  purpose  machines  today.  Typically  these 
computers  have  the  advantage  of  general 
applicability,  but  for  many  applications  they  can 
be  very  inefficient.  In  certain  areas  the 
applications'  importance  and  the  processing  needs 
are  such  that  special  purpose  machines  optimized 
for  a  single  application  would  have  significant 
advantages.  Image  Understanding  and  Artificial 
Intelligence  are  such  areas  where  the  processing 
and  data  access  requirements  are  sufficiently 
stringent  that  novel  architectures  should  be 
considered.  The  advent  of  Very  Large  Scale 
Integrated  Circuitry  (VLSI)  and  high-level  design 
technologies  have  enabled  such  efforts  to  be 
seriously  considered. 

In  addition  to  providing  the  potential  for 
making  more  dense  and  higher  speed  structures,  VLSI 
essentially  provides  an  opportunity  to  re-think  the 
conventional  restrictions  on  computer 
architectures.  An  Important  part  of  this  new 


philosophy  is  the  realization  that  data  access  and 
interconnection  are  now  the  significant  bottleneck 
in  high  speed  processors,  rather  than  the  speed  of 
the  ALU.  This  has  led  to  much  work  aimed  at 
regularizing  the  flow  of  data  through  the  processor 
and  the  elimination  of  long  and  circuitous  data 
paths.  Examples  of  this  approach  are  contained  in 
the  Systolic  and  Wavefront  approaches  of  H.  T.  Rung 
and  S.  Y.  Rung.  Further,  the  need  for  a  regular 
and  uniform  topology  on  the  semiconductor  material 
itself  has  been  recognized  by  experts  such  as  C.  A. 
Meade.  The  benefits  of  regular  arrays  of  devices 
with,  wherever  possible,  only  local  neighbor 
interconnect  and  communication,  are  both  of  higher 
device  speed  and  hence  increased  throughput,  and 
ease  and  feasibility  of  VLSI  design. 

Our  work  on  the  IU  has  recognized  this,  and  we 
have  wherever  possible  structured  the  architecture 
in  terms  of  local,  modular  blocks  of  devices.  Since 
the  extreme  of  this  approach  is  the  memory  array 
with  large  numbers  of  Identical  cells  regularly 
interconnected,  we  have  structured  our  processor 
around  a  look-up  table.  However,  in  order  to 
provide  the  necessary  dynamic  range  and  flexibility 
for  all  low-level  image  understanding  operations, 
we  have  decided  to  implement  the  arithmetic  in 
'residue'  notation.  Using  this  approach  we  have 
been  able  to  provide  a  fully  programmable  processor 
which  operates  over  a  5  *  5  kernel  and  performs  all 
the  commonly  used  arithmetic  functions  in  image 
understanding  using  a  memory  block  of  only  32  x  5 
bits.  Significant  additional  advantages  of  this 
approach  Include  extendability,  selectable  dynamic 
range,  fault  tolerance,  and  the  ability  for 
self-checking.  We  have  developed  a  single  10,000 
transistor  LSI  chip  for  this  processor,  as  shown  in 
Figure  1.  We  currently  use  20  of  these  in  the  full 
processor.  (An  eventual  VLSI  device  with  1  micron 
design  rules  and  80,000  transistors  would  perform 
the  full  operation.) 

To  complement  this  processor  some  form  of 
logic  processor  is  required,  which  typically 
requires  less  throughput  and  considerably  reduced 
dynamic  range.  With  these  two  machines  (the  RADIUS 
and  logic)  a  full  system  consisting  of  DEC  host, 
the  RADIUS,  and  LOGIC  can  be  developed.  These 
Issues  are  discussed  below. 
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RADIUS 


PROCESSOR  DESCRIPTION 
The  Residue  Number  System 

The  residue  number  system  was  extensively 
studied  by  Tsabo  and  Tanaka  .  It  is  based  on  a 
collection  of  N  Integers:  ,  m,,  ....  m  ;  each  of 

which  is  called  a  modulus.  Tne  moduli  must  be 
relatively  prime,  l.e.  have  no  common  factors.  The 
residue  of  x  mod  m  is  the  least  positive  integer 
remainder  of  the  division  of  x  by  m  where  x  is  the 
number  to  be  converted  and  m  is  the  modulus. 
Since  the  dynamic  range  of  the  rull  processor,  H, 
is  given  by  the  product  of  the  moduli,  any  integer 
in  the  range  0  <_  x  <  M— 1  )  can  be  uniquely 
represented.  These  residues  are  somewhat  analogous 
to  the  digits  in  conventional  numbers. 

The  strength  of  the  residue  number  system  lies 
in  the  way  in  which  arithmetic  operations  are 
performed.  Essentially  arithmetic  operations  are 
performed  in  parallel  in  each  base  and  then 
combined  to  tsilquely  determine  the  final  result. 
An  example  of  residue  addition  is  given  in  Figure 
2.  Here  we  choose  the  moduli  of  5,  7  and  8  giving 

a  dynamic  range  M  of  280.  Thus  the  range  before 

overflow  is  from  0  to  279.  In  the  example,  19  is 

converted  to  4,  5,  3  in  the  residue  system.  The  4 

is  the  residue  of  19  mod  5,  etc.,  as  shown.  Each 
colinn  is  then  Independently  added  and  the  sum  is 
expressed  as  a  residue  of  the  associated  modulus. 
In  the  first  column  4+2=6,  which  is  equivalent 
to  1  modulus  5.  The  other  columns  are  processed  in 
a  similar  manner.  The  result  1,  1,  2  means  that 
the  actual  result,  when  divided  by  5,  gives  a 
remainder  of  1,  when  divided  by  7  a  remainder  of  1, 
and  when  divided  by  8  a  remainder  of  2.  These 
numbers  can  then  be  uniquely  decoded  to  determine 
the  result,  106,  within  the  range  0  to  279. 

As  the  example  shows,  the  arithmetic 
operations  performed  on  each  modulus  are 
Independent  of  each  other,  making  them  ideally 
suited  to  parallel  computation.  Furthermore, 
look-up  tables  can  be  used  to  perform  the  operation 
on  each  modulus,  giving  very  high  speed  operation. 
By  using  a  sufficient  number  of  moduli  it  is  easy 
to  produce  an  adequate  dynamic  range,  yet  the 
look-up  tables  can  be  kept  small.  There  is  a 
certain  amount  of  overhead  Involved  in  converting 
from  binary  to  residue  and  residue  to  binary,  but 
the  high  speed  of  the  arithmetic  operations  more 
than  compensates  for  this. 

Implementation  of  Algorithms 

Conversion  from  a  binary  representation  to  a 
residue  representation  is  straightforward.  It 
requires  that  we  calculate 

(x  mod  a1(  x  mod  m2,  ....  x  mod  a^) 

where  x  is  the  value  of  the  input  data  and  a.  is 
the  1th  base.  The  simplest  way  to  perform  this 
calculation  ftor  a  general  set  of  bases  la  to  use 
Read  Only  Memories  (ROMs) .  By  connecting  the  input 
data  to  the  address  lines  of  the  ROM  and  looking  at 


the  ROM's  data  lines  for  the  output,  a  look-up 
function  is  performed.  For  our  particular 
processor  which  will  support  an  eight-bit  input 
dynamic  range  and  bases  which  can  be  encoded  in 
five  bits  or  less,  the  size  of  an  encoding  ROM  is 
256  x  5  bits. 

Once  the  data  have  been  converted  to  residue 
format,  we  can  perform  a  variety  of  calculations  by 
using  random  access  memory.  For  example,  to  do  a 
convolution  on  the  image  data  it  is  necessary  to 
multiply  each  of  the  25  elements  of  the  kernel  by 
the  assigned  value,  and  to  add  the  results.  The 
multiplications  are  done  by  using  a  programmable 
look-up  table  on  each  base  of  each  number,  100 
operations  in  all.  Since  the  largest  base  is  32, 
no  look-up  table  need  be  larger  than  32  x  5  bits, 
and  all  look-ups  are  done  in  parallel. 

Conversion  back  to  binary  format  is  based  on 
the  mixed  radix  method.  In  our  system,  we  exploit 
the  fact  that  we  require  an  output  dynamic  range  of 
only  eight  bits.  The  method  can  be  explained  by 
considering  an  Iterative  process  where  at  every 
iteration  the  smallest  base  is  eliminated  by 
dividing  the  data  value  by  that  base  value. 
Dividing  essentially  reduces  the  dynamic  range  of 
the  value  and  thus  eliminates  the  need  for  the 
extra  base.  Of  course,  since  we  are  limited  to  a 
strictly  integer  system,  we  must  make  sure  that  the 
value  is  evenly  divisible  by  the  smallest  base. 
This  can  be  done  by  rounding  up  or  down  so  that  the 
element  in  the  residue  vector  for  that  base  is 
zero.  An  architecture  that  performs  this  mixed 
radix  conversion  for  a  four  base  system  looks  like 
the  3-level  tree  shown  in  Figure  3.  At  the  bottom 
level  of  this  tree  structure  the  fourth  base  is 
eliminated.  At  the  next  highest  level  of  the  tree 
the  third  base  is  eliminated.  Finally  we  are  left 
with  two  base  values  which  can  be  decoded  with  a 
simple  look-up  table. 

Hardware  Configuration  (Data  flow) 

The  actual  computations  on  the  image  data  are 
performed  by  a  custom  LSI  NHOS  circuit.  The 
circuit  processes  a  5  x  1  kernel  and  is  capable  of 
performing  computations  of  the  form 

y  =  EfjCXj) 

where  y  is  the  output  value,  x.  are  the  five 
elements  in  the  kernel,  and  ft  are  polynomial 
functions  of  a  single  variable. 

A  schematic  of  the  circuit  is  shown  in  Figure 
4.  The  word  size  for  this  is  five  bits,  which 
limits  the  bases  used  to  a  value  of  32  or  less. 
The  circuit  la  designed  to  accept  a  5-blt  input 
word  (at  a  10  MHz  rate)  which  is  clocked  into  a 
5-element  shift  register.  The  contents  of  each 
register  element  is  then  shifted  to  the  next 
register.  The  dsta  In  each  of  the  shift  register 
elements  la  used  to  address  a  look-up  table,  whloh 
is  a  32  x  5  Random  Access  Memory  (RAM) .  This  look 
up  table  performs  operations  requiring  only  a 
single  operand  such  as  multiplication  by  a  constant 
or  a  squaring  operation.  The  outputs  of  the  5  RAMS 
are  then  summed  modular ly  to  produoe  a  5-blt 


output,  the  base  of  the  modular  addition  being 
programmable  by  external  control  of  the  circuit. 
Since  the  look-up  tables  are  composed  of  RAMs  the 
circuit  can  be  programmed  for  many  different 
computations,  such  as  different  weights  for  a 
convolution  or  different  powers  of  a  number  for  a 
statistical  calculation. 

To  utilize  this  circuit  (with  a  5  x  1  kernel) 
in  a  5  x  5  local  area  processor,  multiple  copies  of 
the  circuit  are  used  as  well  as  additional  logic  to 
combine  the  outputs  of  the  individual  circuits. 
For  each  base,  5  of  these  circuits  are  used,  one 
for  each  line  of  the  kernel.  In  addition,  four 
102k  x  5-bit  ROMs  are  used  to  sum  the  outputs  of 
the  five  circuits.  Figure  5  shows  a  block  diagram 
of  the  processor  and  Figure  6  is  a  picture  of  the 
actual  processor. 

UNIBUS  INTERFACE 

A  most  important  feature  of  the  RADIUS 
processor,  in  addition  to  its  potential  for  large 
scale  integration  for  military  systems,  etc.,  is 
that  it  can  be  attached  to  a  general  purpose 
machine  in  a  research  environment  for  the 
development  of  high  level  image  understanding 
techniques.  In  fact,  the  high  speed  design  (data 
are  accepted  at  100  nanosecond  intervals)  will  also 
allow  real  time  stand  alone  operation,  but  other 
considerations  become  Important  when  the  RADIUS 
processor  is  used  as  a  peripheral  device  with  a 
general  purpose  computer. 

In  order  to  get  optimal  use  of  the  processor 
we  need  the  fastest  type  of  transfer  available 
between  the  processor  and  the  main  memory,  where 
the  data  to  be  processed  will  reside.  The  Direct 
Memory  Access  (DMA)  type  of  transfer  Is  the  fastest 
type  of  transfer  that  a  general  purpose  computer 
can  support,  since  it  does  not  require  processor 
intervention.  For  DEC  UNIBUS  applications,  the 
data  rate  is  approximately  1  Megabyte/sec. 

We  have  designed  an  Interface  around  a  DEC 
DR11B  UNIBUS  parallel  interface  card.  This  provides 
both  the  DMA  transfer  capability  and  several 
control  lines  to  allow  multiple  transfer  modes.  To 
operate  the  interface  the  starting  address  and  the 
amount  of  data  to  be  transferred  are  specified,  and 
a  start  code  given.  The  DRUB  then  transfers  the 
specified  data  without  further  Intervention.  The 
control  lines  specify  whether  look-up  table  data  or 
image  data  is  involved  in  the  transfer.  For 
look-up  table  data,  the  interface  simply  passes  the 
data  to  the  16  bus  lines  in  the  processor.  For 
image  data  transfers,  the  Interface  is  more 
complex.  Since  the  DR11B  transfers  16  bits  at  a 
time,  each  word  contains  two  eight-bit  pixels. 
Following  the  word  transfer,  the  Interface  splits 
it  into  two  bytes  which  are  then  fed  consecutively 
to  the  RADIUS  processor.  After  transferring  ■ 
complete  line  of  pixels,  the  DRUB  Is  switched  into 
read  mode  and  one  line  of  processed  Image  data 
(atored  in  the  digital  line  delay)  Is  read  back 
Into  the  computer. 

Several  diagnostic  capabilities  are  built  Into 
the  Interface.  The  standard  DRUB  card  can  be 


tested  alone,  and  the  custom  built  part  of  the 
interface  can  also  be  tested  once  operation  of  the 
DRUB  has  been  verified.  Also,  the  contents  of  the 
look-up  tsbles  in  the  RADIUS  processor  can  be  read 
back  into  the  computer  to  verify  their  accuracy. 

The  entire  interface  is  built  on  wire  wrapped 
cards  which  plug  directly  into  the  backplane  of  an 
11/34.  This  connects  to  the  RADIUS  chassis  with 
ribbon  cable  so  that  the  entire  assembly  can  be 
simply  plugged  into  a  UNIBUS.  Furthermore  the 
software  is  being  configured  so  that  the  RADIUS 
processor  can  be  called  as  subroutine. 
Consequently,  any  DEC  computer  user  is  able  to  do 
rapid  5x5  image  processing  operations  with  great 
ease.  Since  the  UNIBUS  operates  at  1  M  Byte/ sec,  a 
256  x  256  image  can  be  processed  in  132  msec. 

RESULTS 

The  RADIUS  processor  performs  a  variety  of 
local  neighborhood  arithmetic  operations  using  a 
5x5  kernel.  A  typical  function  that  can  be 
programmed  is  a  convolution  or  a  sum  of  products 
over  the  5x5  window.  To  program  the  processor  to 
perform  the  convolution,  the  25  weighting  ractors 
are  selected,  and  a  multiplication  look-up  table  is 
generated  for  each  weighting  factor  for  each  of  the 
four  bases.  The  only  two  constraints  in  the 
selection  of  the  weighting  factors  are  that  they  be 
integers  and  that  the  combination  of  the  dynamic 
range  of  the  weights  and  input  data  not  exceed  the 
internal  dynaojjc  range  of  the  processor, 
approximately  21  .  These  look-up  tables  are 
transferred  into  the  processor,  which  is  then 
capable  of  performing  the  programmed  convolution  at 
input  pixel  rates  up  to  10  MHz. 

A  convolution  is  Just  one  of  the  possible 
operations  that  can  be  performed.  Another  useful 
function  for  image  processing  that  RADIUS  Is 
capable  of  performing  is  the  sum  of  squares 
operation.  This  is  used  frequently  in  the 
calculation  of  variances  and  other  statistical 
moments  which  are  commonly  used  for  texture 
discrimination.  To  program  the  processor  for  sum 
of  squares  one  calculates  a  squaring  table  for  each 
base  and  transfers  these  tables  into  the  processor. 
In  general,  the  processor  can  evaluate  an  integer 
coefficient  polynomial  function  for  each  element  in 
the  5x5  local  area  and  then  sum  the  results  for 
each  of  the  25  functions.  Table  1  lists  some  of 
the  functions  that  RADIUS  car  perform. 

A  typical  example  of  the  processing  capability 
of  the  architecture  is  shown  in  Figure  7.  Here  we 
show  an  input  scene  (a)  which  has  been  decomposed 
into  six  (b  through  g)  at  angles  ns/6  (for 
ns  0  to  5).  Each  of  these  resulting  images 
provides  a  calculation  of  the  edge  magnitude  for  a 
particular  direction,  and  in  this  sense  a  full  edge 
evaluation  has  therefore  been  coaipleted.  This  Is 
most  important  for  any  subsequent  line  finding  for 
object  Identification  or  image  registration,  etc. 
A  commonly  used  and  effective  technique  based  on 
the  Nevatia  Babu  ooncept  is  to  perform  these  6 
separate  edge  calculations  for  each  pixel  in  the 
original  Image  as  shown,  and  then  to  select,  again 
for  each  pixel,  the  largest  edge  magnitude.  This 


44 


therefore  provides  the  dominant  edge  vector  for 
each  location  in  the  image,  and  can  be  readily  used 
in  the  subsequent  thinning  and  connectivity 
algorithms  for  full  line  tracing.  This  vector  edge 
composite  as  calculated  by  the  processor  is  shown 
in  Fig.  7(h).  The  form  of  this  particular  example 
is  a  linear  sum  of  products  or  convolution,  but  it 
should  be  emphasized  that  other  non-linear 
operations  such  as  variance,  standard  deviations, 
etc.,  can  similarly  be  performed  using  RADIUS. 

LOGIC  PROCESSOR 

As  has  been  said  many  times  before,  the  main 
computational  bottlenecks  in  image  analysis  are 
operations  which  are  performed  in  the  pixel  domain. 
There  have  been  many  processors  which  have  been 
developed  which  can  do  very  fast  operations  in  the 
pixel  domain,  but  they  are  typically  constrained  in 
the  type  of  operations  they  can  perform.  Since  the 
development  of  pixel  level  operations  is  still  a 
very  active  field  it  seems  only  reasonable  that  new 
processors  will  be  required  to  implement  these  new 
operations. 

PROCESSOR  REQUIREMENTS 

Many  operations  that  are  performed  on  every 
pixel  have  a  fixed  data  flow  and  require  no  logic 
or  decision  capability  to  be  performed.  For  these 
type  of  operations  a  processor  such  as  RADIUS  is 
ideally  suited.  However  there  are  many  operations 
which  are  performed  at  the  pixel  level  which  do 
require  some  sort  of  logic  or  decision  capability. 
There  are  processors  which  are  suited  to  these 
logic  operations  but  they  have  not  been  designed  to 
handle  the  sophisticated  operations  that  are  being 
developed  and  refined  in  the  image  uiderstanding 
comminity  today.  These  operations  include  edge 
thinning,  edge  linking,  shrink  and  swell 
algorithms,  border  following,  and  many  more.  He 
have  examined  some  of  these  operations  to  determine 
what  primitive  operations  will  need  to  be  supported 
in  a  new  generation  logic  image  processor. 

Edge  Thinning 

The  edge  thinning  algorithm  we  chose  as  a 
representative  example  is  the  Nevatia-Babu  thinning 
that  is  incorporated  in  their  line  finder  system. 
The  basic  algorithm  is  as  follows  : 


1.  Form  a  3x3  neighborhood 
around  the  edge  point  of 
Interest. 

2.  Get  the  direction  of  that 

edge. 

3.  Extract  the  two  neighboring 
edge  points  which  are 
perpendicular  to  the 
direction  of  the  center 
edge. 


their  magnitudes  to  that  of 
the  center  pixel. 

5.  Replace  the  center  pixel 
with  0  if  the  tests  are 
false,  otherwise  leave 
alone. 

We  find  several  types  of  operations  in  this 
thinning  algorithm.  First  of  course  is  the 
neighborhood  formation,  which  is  certainly  expected 
but  bears  mentioning.  Second  is  the  use  of  an 
additional  input,  edge  direction,  to  control  or 
select  the  operation.  Third  is  the  tests  tAiich  are 
performed  on  both  the  edge  direction  and  edge 
magnitude  for  both  the  center  pixel  and  the  two 
selected  neighboring  pixels.  Finally  the  actual 
function  determination  and  pixel  replacement. 

Edge  Linking 

The  algorithm  we  have  chosen  to  be 
representative  of  this  type  of  operation  is  the 
linking  portion  of  the  Nevatia-Babu  line  finder, 
which  is  performed  after  the  edge  thinning.  This 
linking  operatior  has  been  designed  with  two 
phases.  The  first  phase  calculates  a  predecessor 
and  sucessor  for  each  edge  point.  The  second 
operation  then  constructs  the  lists  of  connected 
edges.  The  first  phase  of  the  linking  is  well 
suited  to  a  raster  oriented  pixel  processor,  while 
the  second  phase  requires  a  random  access 
capability  to  the  data,  so  we  only  considered  the 
operations  in  the  first  phase. 

This  linking  operation  is  more  complicated 
than  the  thinning  operation  in  that  comparisons 
must  be  made  on  both  the  edge  directions  and 
magnitudes.  Successor  candidates  are  generated  by 
comparing  the  direction  of  the  eight  neighboring 
pixels  to  the  direction  of  the  center  pixel.  A 
neighbor  is  a  possible  candidate  if  its  direction 
Is  within  thirty  degrees  of  the  center  pixels 
direction.  To  select  among  several  possible 
successors  their  individual  directions  must  be 
compared  as  well  as  their  magnitudes. 

Coarse  to  Fine  Operation 

The  combination  of  shrink  and  swell  functions 
can  be  used  to  look  at  the  image  at  different 
resolutions,  i.e.  go  from  fine  to  coarse.  This  is 
useful  for  deleting  small  components  from  Images, 
calculating  "skeletons"  of  components,  determining 
sparsity  of  components,  etc.  The  simple  operations 
of  shrink  and  swell  are  as  follows  : 


Swell  :  Change  0's  to  1 's  if  they  have 
any  Vs  as  neighbors. 

Shrink:  Change  I's  to  0's  if  they  have 
any  0's  as  neighbors. 

The  major  functions  performed  are  the  generation  of 
the  neighborhood,  a  test  for  I's  or  0's  in  the 
neighborhood ,  and  a  replacement  operation  which 
depends  on  the  outcome  of  the  test.  This  operation 
can  be  enhanced  by  using  a  slightly  more 


«.  Perform  tests  (comparisons) 
on  the  directions  of  the 
two  neighbors  and  compare 
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complicated  test  on  the  neighborhood.  By 
determining  if  a  point  is  simple,  i.e.  its 
neighborhood  has  exactly  one  point  adjacent  to  p, 
then  the  shrink  and  swell  operations  preserve 
component  connectedness  in  the  image,  and  can  be 
used  to  count  connected  components  in  the  image. 

Border  Following 

Rosegfeld  and  Kak  describe  a  border  following 
algorithm  which  is  an  iterative  labeling  scheme, 
but  which  uses  random  access  of  the  pixels. 
Presumably  similar  methods  can  be  constructed  that 
would  have  comparable  performance  but  could  use 
raster  access  to  the  pixels.  The  algorithm  works 
by  creating  a  seed  to  the  border  and  then 
propagating  the  border  by  looking  at  a  subset  of 
the  neighbors  and  either  stopping  on  some  criterion 
or  setting  the  value  of  the  center  pixel  to  a 
border  value  and  selecting  the  next  center  pixel. 
In  a  raster  algorithm  additional  logic  would  be 
required  to  determine  if  the  pixel  was  a  border, 
but  this  logic  should  be  comparable  to  that  of 
selecting  the  next  border  point.  The  types  of 
tests  performed  by  the  algorithm  include  comparing 
the  center  pixel  to  a  constant  (3).  comparing 
neighbors  to  two  constants  2  and  9  and  selecting 
the  first  neighbor  (in  a  clockwise  rotation)  that 
is  nonzero. 

Zero  Crossings 

This  operation  is  well  known  for  its  use  in 
the  generation  of  the  primal  sketch,  whereby  a 
functions  first  derivative  is  maximized  by  finding 
zero  crossings  in  the  second  derivative.  To 
perform  this  operation  one  neeYs  to  look  in  the 
eight  directions  from  the  center  pixel  and 
determine  if  there  is  a  change  of  sign.  This 
implies  that  the  processor  is  able  to  form  the  3*3 
neighborhood,  represent  positive  and  negative 
values,  and  compare  signs. 

We  have  examined  five  different  functions 
which  require  different  kinds  of  logical  operations 
to  be  performed  on  image  pixels.  Table  2 
summarizes  the  set  of  primitive  operations  that 
would  be  required  of  a  processor  which  could 
perform  all  of  the  functions  we  have  examined. 

SUMMARY 

We  have  described  a  novel  processor  (RADIUS) 
which  uses  residue  arithmetic  to  achieve  the  high 
throughputs  required  for  real-time  image 
understanding.  This  processor  is  currently 
implemented  using  a  custom  LSI  circuit  but  is 
ideally  suited  to  a  VLSI  implementation  because  of 
its  memeroy  intensive  architecture.  In  addition  we 
have  developed  an  interface  which  allows  RADIUS  to 
be  attached  to  a  general  purpose  machine  and  be 
used  to  speed  up  processing  times  of  complex  IU 
software  systems.  A  natural  extension  to  RADIUS, 
which  does  the  computationally  intensive 
operations,  is  a  processor  capable  of  performing 
logic  operations  on  a  grey  level  image.  We 
presented  preliminary  results  of  a  study  to 
determine  the  functional  requirements  of  such  a 
processor.  This  study  will  be  extended  to  Include 


a  wider  variety  of  systems  as  well  as  better 
determining  the  needs  of  the  IU  community  as  a 
whole. 
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Fig.  1.  Photograph  of  Residue  LSI  circuit 
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Example  of  Addition  Using  Residue 
Arithmetic 


Fig.  3.  Mixed  Radix  Residue  Decoder 


Fig.  5.  Block  Diagram  of  RADIUS  Processor 
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Fig.  6.  Photograph  of  RADIUS  Processor 


Schematic  of  Residue  LSI  Circuit 
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(H)  COMPOSITE 


POINT  OPERATIONS 

POLYNOMIAL  FUNCTIONS 
CONTRAST  ENHANCEMENT 

1-  DIMENSIONAL  OPERATIONS 

INTEGER  COEFFICIENT  TRANSFORMS 
POLYNOMIAL  FUNCTIONS 

2-  DIMENSIONAL  OPERATIONS 

EDGE  ENHANCEMENT 
STATISTICAL  DIFFERENCING 
LOW  PASS/HIGH  PASS  FILTERING 
SHAPE  MOMENT  CALCULATIONS 
STATISTICAL  MOMENT  CALCULATIONS 
INTEGER  COEFFICIENT  TRANSFORMS 
TEXTURE  ANALYSIS 


Table  1.  Functional  Capabilities  of  RADIUS 
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PRIMITIVE  FUNCTION 

ALGORITHMS  WHICH  USE  IT 

Form  neighborhood 

All 

Compare  neighborhood 

Edge  thinning. 

values  (including  center) 

border  following 

to  a  constant  or  multiple 

constants 

Compare  neighborhood 

Edge  thinning. 

values  to  each  other 

Edge  linking, 

or  to  a  range  determined 

Zero  crossings 

by  other  neighborhood 

values 

Determination  of  adjaceny 

Edge  linking. 

(4  or  8)  or  simple 

Border  following, 

connectedness 

Shrink  and  swell 

Use  neighborhood  from 

Edge  thinning, 

multiple  pictures,  i  e.. 

Edge  linking 

magnitude  and  direction 

Fig.  7.  Example  of  RADIUS  applied  to  Edge 
Detection 


Table  2.  Functional  Requirements  of  Logic  Processor 
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A  SURVEY  AND  EVALUATION  OF  FL1R  TARGET 
DETECTION/SEGMENTATION  ALGORITHMS 
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V  ABSTRACT 

A  study  was  conducted  at  Westinghouse  and  the  University  of 
Maryland  to  survey  and  evaluate  FLIR  target  detection/seg¬ 
mentation  algorithms.  The  study  was  conducted  in  software  on 
a  data  base  of  $0  FLIR  images  supplied  by  the  three  branches  of 
the  armed  services.  The  study  concluded  that  three  techniques 
(double-window  filters,  spoke  filters,  and  border  followers) 
perform  fairly  well  and  are  implementable  in  real-time 
hardware. 

INTRODUCTION 

This  report  describes  results  of  a  study  conducted  both  at 
Westinghouse  and  the  University  of  Maryland  under  the 
DARPA  Image  Understanding  Program.  One  objective  of  the 
study  is  to  survey  and  evaluate  FUR  target  detection/segmen¬ 
tation  algorithms.  Candidate  algorithms  were  chosen  from 
those  developed  at  Westinghouse,  thp  University  of  Maryland 
and  other  academic,  governmental,  and  industrial  organiza¬ 
tions.  Algorithms  from  the  initial  list  were  evaluated  over  a 
common  data  base  if  and  only  if:  they  passed  certain  prelimi¬ 
nary  tests,  performed  well  in  previous  studies,  or  were  claimed 
to  perform  exceptionally  well  by  their  authors. 

Each  algorithm  was  tested  by  its  author’s  organization  on  the 
organization’s  own  computer  facility.  The  study  had  one  rule: 
“No  algorithmic  parameters  could  be  changed  by  human  inter¬ 
vention  during  the  evidential  run  through  the  data  base.*”  Re¬ 
finements  could  be  made  to  the  software  before  this  final  pass 
through  the  data  base. 

DATA  BASE  COMPILATION 

A  data  base  of  30,  128  x  128  pixel  FLIR  images  was  com¬ 
piled.  Several  images  from  the  data  base  are  shown  in  Figure  1 . 
The  sources  of  the  images  were  four  larger  data  bases  prepared 
by  the  three  branches  of  the  armed  services.  Twenty  of  the  50 
images  were  constructed  from  64  x  64  pixel  target  images 
which  were  repeated  four  times  by  reflecting  them  about  hori¬ 
zontal  and  vertical  axes  (e.g.,  see  Figure  lb).  These  were  used  to 
test  the  algorithms’  sensitivity  to  the  orientation  of  targets. 

The  data  base  was  compiled  with  a  goal  of  diversity.  Often 
results  are  reported  in  the  literature  in  which  an  algorithm  is 
tested  on  a  very  small  collection  of  similar  images.  This  allows 
an  algorithm  to  be  finely  tuned  to  the  ensemble  statistics.  In 


compiling  a  diverse  data  base,  it  was  hoped  to  simulate  realistic 
conditions  in  which  little  a  priori  information  is  available. 

EVALUATION  APPROACH 

The  data  base  was  carefully  hand  segmented  to  obtain  the 
centroid  (C,,  Cj)  and  vertical  (Rt)  and  horizontal  (Rj)  radii  of 
each  of  the  targets  (Figure  2).  This  hand  segmentation  was  a 
cooperative  effort  of  several  persons  who  had  knowledge  of  the 
data  base  contents.  All  algorithms  were  required  to  use  this 
same  format  for  output.  That  is,  for  each  detected  target,  in 
each  image,  the  output  is  an  estimate:  (C,,  CJt  ft,,  ftj). 

A  target  is  said  to  correctly  detected  iff: 

{|C,  -  C,|  £  >AR,  +  ‘riand  |Cj  -  C,|  S  '/iRj  +  '/i}.  (1) 

where  all  measures  are  in  image  picture  elements  (pixels).  A 
false  alarm  is  any  detection  outside  this  prescribed  region.  Extra 
detections  inside  the  prescribed  region  are  each  scored  as  1/3 
false  alarm. 

With  reference  to  Figure  3,  let  B  denote  the  rectilinearly  ori¬ 
ented  box  hjjt  enclosing  a  target  as  determined  by  hand  segmen¬ 
tation.  Let  B  denote  its  estimate  as  output  by  a  computer  pro¬ 
gram.  Segmentation  accuracy  is  estimated  by: 


or  the  common  area  divided  by  the  total  area  of  the  figures, 
averaged  over  detected  targets.  Note:  0  sA  s  1. 

TESTS  PERFORMED  BY  COMPUTER  SIMULATION 

One  goal  of  this  project  is  to  determine  the  suitability  of  var¬ 
ious  hardware  implementations  of  the  candidate  algorithms. 
This  entails  an  evaluation  of  both  performance  and  implcmcn- 
tability.  However,  the  tests  discussed  here  took  place  only  in 
software  and  did  not  take  advantage  of  the  additional  informa¬ 
tion  which  would  be  available  to  a  real-time  target  acquisition 
system,  such  as  range  to  center  of  field  of  view.  This  point  will 
be  expanded  upon  below. 

A  target  acquisition  system  typically  consists  of  a  number  of 
pipelined  stages.  The  first  stage  accepts  images  from  the  FLIR 
and  does  some  preprocessing.  This  Preprocessor  usually  takes 
the  form  of  one  or  more  local  filters  which  reduce  noise,  extract 
local  features,  and/or  increase  the  contrast  between  targets  and 
background. 


*  Note:  Image  polarity  was  allowed  as  an  input. 
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The  second  stage  is  the  Detector,  which  locates  blobs  it  sus¬ 
pects  of  being  targets.  The  centroids  of  these  candidate  targets 
are  then  passed  to  a  Segmentor  and/or  Feature  Extractor.  This 
third  stage  either  determines  the  borders  of  the  detected  blobs 
or  extracts  features  to  describe  them.  This  information  is  passed 
to  the  fourth  stage,  called  the  Classifier.  This  final  stage  classi¬ 
fies  potential  targets,  prioritizes  them,  determines  confidence 
probabilities,  and  outputs  this  data  along  with  target  locations. 

A  real-time  system  mounted  on  a  moving  vehicle  should  take 
advantage  of  the  following  types  of  additional  information 
which  are  usually  available. 

(1)  Range  -  The  range  to  the  center  of  the  field  of  view  will 
usually  be  known.  This  will  permit  calculation  of  expected  tar¬ 
get  size  in  pixels  at  that  location.  Size  in  pixels  at  other  locations 
in  the  field  of  view  can  be  estimated  from  assumptions  or 
knowledge  of  terrain  geometry. 

(2)  History  -  Past  values  of  any  computed  parameters  can  be 
stored.  (For  example,  the  location  of  a  detected  target  on  past 
frames  can  be  used  to  predict  current  location.) 

(2)  Control  Loops  -  Feedback  loops  can  be  set  up  between 
any  stages.  (For  example,  the  output  of  the  Classifier  can  be 
used  to  control  thresholds  in  the  Detector.) 

If  a  real-time  system  fits  the  model  described,  then  separate 
circuit  boards  might  be  used  for  each  stage.  Boards  implement¬ 
ing  different  algorithms  could  be  interchanged,  and  the  algo¬ 
rithms  tested  on  large  quantities  of  data.  If  board  pin-outs  and 
protocols  are  standardized,  then  even  boards  developed  by  dif¬ 
ferent  companies  could  be  interchanged  and  tested  in  a  compet¬ 
itive  manner.  No  such  standardization  now  exists.  Further¬ 
more,  in  current  real  systems,  the  separation  in  functions 
between  the  various  stages  is  often  not  clearly  defined.  An  algo¬ 
rithm  implemented  on  a  board  may  combine  parts  of  several 
stages.  Even  the  basic  hardware  organization  may  be  different 
from  that  discussed  above,  with  some  stages  operating  in  paral¬ 
lel  or  in  a  different  order  than  indicated.  There  is  always  a 
tradeoff  in  performance  between  the  various  stages,  since  it  is 
only  the  output  of  the  final  stage  which  really  matters.  For  ex¬ 
ample,  it  is  acceptable  for  a  detector  to  operate  at  a  high  false 
alarm  rate  if  a  classifier  will  later  cull  out  counterfeit  targets. 
This  is  an  important  qualification  for  a  study  such  as  this, 
where  individual  algorithms  are  being  tested,  not  entire  sys¬ 
tems.  We  will  deal  in  part  with  this  concern  by  plotting  detec¬ 
tion  rate  vs.  false  alarm  rate,  though  our  results  are  not  as  com¬ 
prehensive  as  may  be  desired. 

CLUSTERING  IN  MEASURE  SPACE 

The  standard  method  of  segmenting  an  image  is  by  gray  level 
thresholding.  Here  the  classes  correspond  to  gray  level  ranges, 
e.g.,  “light  -  hot”  and  "dark = cool".  Since  these  ranges  are  not 
known  in  advance,  they  must  be  determined  by  examining  the 
gray  level  histogram  and  looking  for  peaks  (one  dimensional 
clusters),  and  choosing  thresholds  (one  dimensional  decision 
surfaces)  that  separate  the  peaks. 

A  number  of  investigators  have  suggested  that  multidimen¬ 
sional  feature  space  should  also  be  useful  for  segmenting  com¬ 
plex  gray  scale  images.  A  variety  of  features  may  be  defined 
over  a  neighborhood  set  about  a  pixel,  e.g.,  mean,  median,  var¬ 
iance,  commonality,  total  variation.  This  approach  could  be 
employed  when  a  single  feature,  such  as  gray  level,  is  not  ade¬ 
quate  for  segmentation  because  the  given  image  contains  a 
number  of  textured  regions  whose  gray  level  ranges  overlap. 

Initial  work  at  VAstinghouse  has  indicated  that  thresholding 
by  cluster  detection  in  histograms  is  not  adequate  for  separating 
targets  from  background.  Gray  level  target  and  background 
dusters  a it  often  not  separable,  i.e.,  their  probability  densities 


overlap.  Likewise,  the  response  of  local  operators  tends  to  be 
rather  variable,  not  yielding  well  defined  clusters.  The  basic 
weakness  of  segmentation  schemes  which  use  only  local  feature 
values  is  that  they  attempt  to  classify  image  parts  without  re¬ 
gard  to  their  relative  positions  in  the  image.  It  should  not  sur¬ 
prise  us  that  any  approach  which  does  not  take  spatial  contigu¬ 
ity  fully  into  account  fails  much  of  the  time. 

ALGORITHMS  SELECTED 

The  segmentation  algorithms  investigated  in  our  study  are 
those  that  make  use  not  only  of  similarity  but  also  proximity. 
The  candidate  algorithms  have  been  classified  into  six  groups: 

(1)  Double  Window  Filters 

(2)  Spoke  Filters 

(3)  Border  Followers 

(4)  Relaxation  Algorithms 

(5)  Pyramid  Approaches 

(6)  Mode  Seekers 

Each  of  these  approaches  will  be  described  below.  At  least 
one  method  of  each  type  will  be  tested  on  the  assembled  data 
base. 


DOUBLE  WINDOW  FILTERS 
A  double  window  filter  slides  two  non-overlapping  windows 
over  an  image.  The  windows  are  both  commonly  rectangles, 
with  one  surrounding  the  other  (Figure  4).  The  intensities 
within  the  inner  window  are  viewed  as  samples  of  a  random 
variable  X  and  those  of  the  outer  window  as  samples  of  a  ran¬ 
dom  variable  Y.  The  objective  is  to  determine  if  the  inner  win¬ 
dow  surrounds  a  target,  while  the  outer  window  contains  back¬ 
ground  clutter  (Figure  3).  Since  little  a  priori  information  is 
available  about  target  and  background  statistics,  some  assump¬ 
tions  are  usually  made  before  the  problem  is  formulated  and 
solved.  One  way  to  pose  the  problem  is  in  the  language  of  statis¬ 
tical  hypothesis.  Doing  so  usually  involves  choosing  distribu¬ 
tions  to  model  the  behaviors  of  X  and  Y.  Results,  of  course,  are 
valid  only  if  the  chosen  distributions  correctly  describe  the  ex¬ 
perimental  situation,  namely  that  they  provide  the  correct  sta¬ 
tistical  model.  Although  simple  statistical  hypotheses  concern 
only  the  parameters  of  assumed  or  known  distributions,  it  is 
also  possible  to  define  composite  hypotheses  which  also  con¬ 
cern  the  fundamental  forms  of  the  distributions  themselves.  In 
either  case,  to  construct  a  criterion  for  testing  a  given  statistical 
hypothesis  requires  the  formulation  of  an  alternate  hypothesis. 
Symbolically,  we  will  let  H„  stand  for  the  null  hypothesis  and 
HA  stand  for  the  alternate  hypothesis. 


r  - 

o'  r 
.'y 

V 

V 


YYYYYYyYYY 
^Y_Y_Y_Y_Y_Y_Y_Y_Y  Y 

r*  *  rr*  %  *  r  * 


V 

;• 


rrrrrrrrr/ 

YYYYYYYYYY 


Fiftre  4.  Whktow  Gwwtry 


H„  is  rejected  at  a  significance  level  a  if  and  only  if  t  2  c,  where 
t  is  the  experimental  value  of  T.  If,  for  instance,  n  =  10,  m  =  6, 
and  a  =  0.05,  then  c  =  2. 145. 

If  targets  are  always  known  to  be  hotter  (i.e.,  of  higher  inten¬ 
sity  level)  than  the  background,  the  test  can  be  reformulated  as: 


DETECTION 


This  approach  and  more  complex  formulations  are  conven¬ 
tionally  used  by  Westinghouse  in  detection  of  targets  in  radar.  A 
number  of  companies  are  using  it  for  target  detection  in  FLIR 
imagery. 

Texas  Instruments  (Tl]  and  Ford  Aerospace  [FA]  have  devel¬ 
oped  filters  of  this  type.  Tl  [2]  uses  a  metric  of  the  form: 


NO  DETECTION 


FA  [1]  divides  the  outer  window  into  N  subregions.  They  use  a 
metric  of  the  form: 


Figure  5.  Three  cases  of  targel/HIter  relationship:  (a)  Urge! 

completely  within  inner  window,  (b)  target  partially 
within  inner  window,  (c)  target  outside  inner  win¬ 
dow. 

A  typical  simple  hypothesis  is  as  follows: 

X|,.„,  Xm  is  an  independent  random  sample  of  a  normal  ran¬ 
dom  variable  X  with  mean  ux  and  unknown  vari¬ 
ance  «2X 

V  . . Yn  is  an  independent  random  sample  of  a  normal  ran¬ 

dom  variable  Y  with  mean  /»y  and  unknown  variance 

^Y 

H„:  Mx  =  M y 
ha:  Ax* My 


N 

C  -  E  dx  -  !y  00), 

k  =  l  (6) 

where  I,  is  the  mean  of  inner  window  pixels  exceeding  a  speci¬ 
fied  threshold  and  IY  (k)  is  the  mean  of  outer  window  pixels  in 
subregion  k  exceeding  the  threshold.  Both  Tl  and  FA  use  range 
for  the  control  of  window  size.  Neither  referenced  paper  pro¬ 
vides  a  statistical  model  for  the  experimental  design. 

WoJlfson  of  Westinghouse  sees  no  justification  in  assuming 
normality  or  any  other  statistical  distribution.  He  suggests  an 
approach  based  on  estimates  of  the  probability  density  func¬ 
tions  of  random  variables  X  and  Y  (as  obtained  from  their  sam¬ 
ple  values).  A  computer  program  was  written  to  test  this  con¬ 
cept  .  The  upper  three  bits  of  sample  intensity  levels  were  used  to 
estimate  probability  density  functions  fx  and  fy .  Since  three 
bits  were  used,  th^se  estimates  in  effect  were  quantized  to  eight 

entries  each;  i.e.,  fx  =  (?x  (I) . ?x<8)),  fy  =  (fy(l) . ?y(8)) 

A  target  is  detected  if  and  only  if  the  maximum  intensity  level  in 
the  inner  window  exceeds  that  of  the  outer  window  and 

8 

E  |fx(i)  -  ?y«)I 
i-1 


This  test  can  be  performed  in  terms  of  a  quantity  Twhich  has 
a  [-distribution  with  m  +  n  -  2  degrees  of  freedom  [18]: 

where 


k(X  -  V) 


exceeds  a  given  threshold  Since  range  was  not  available  for 
these  experiments,  five  filters  were  used  in  order  of  decreasing 
size,  with  a  detection  occurring  upon  first  exceeding  the  thresh¬ 
old.  Since  several  detections  often  occur  over  nearby  pixels,  de¬ 
tections  were  spatially  clustered  and  replaced  by  their  cluster 
centroid  The  associated  window  widths  and  heights  were  aver¬ 
aged  to  form  an  estimate  of  target  extent.  Results  are  shown  in 
Figures  6  and  7.  The  method  works  fairly  well.  It  has  difficulty 
with  targets  near  image  borders,  but  'his  is  not  a  serious  prob¬ 
lem  in  an  actual  target  acquisition  system  where  the  input  imag¬ 
ery  is  typically  875  x  875  pixels  in  size  and  targets  cover  a  rela¬ 
tively  small  area.  Since  this  method  uses  raw  gray  levels,  rather 
than  edges,  it  is  bound  to  produce  a  fairly  stable  output  from 
frame  to  frame. 
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Figure  6.  Detection  Rate  vs  False  Alarm  Rate 
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Figure  7.  Detection  Rate  vs  Segmentation  Accuracy 


SPOKE  FILTER 

Minor  and  Skiansky’s  [4]  spoke  filter  is  a  generalization  of 
the  Hough  circle  detector  [5].  The  spoke  filter  uses  an  8-spoke 
digital  mask,  which  can  be  designed  to  detect  either  light  or 
dark  blobs,  as  illustrated  in  Figure  8.  The  arrows  in  the  figure 
define  a  template.  Each  detected  edge  falling  within  the  filter 
area  is  examined  to  determine  if  its  strength  exceeds  a  threshold 
and  matches  the  direction  of  the  corresponding  template  ele¬ 
ment.  An  8-bit  image  size  detection  buffer  is  used  to  store  edge 
matches.  Each  bit  corresponds  to  one  spoke  direction.  The  n-th 
bit  is  set  at  buffer  location  (x,y)  if  at  least  one  match  is  obtained 
along  the  n-th  spoke  when  the  filter  is  centered  at  (x,y).  Upon 
completion  of  the  spoke  Filtering  operation,  a  3  x  3  OR  filter  is 
convolved  with  the  detection  buffer.  A  detection  is  then  consid¬ 
ered  obtained  at  position  (x.y)  if  at  least  N  bits  of  the  (x,y) 
location  of  the  buffer  are  set  (typically  N  =  4). 

Minor  and  Sklansky  perform  segmentation  after  detection  is 
completed.  The  segmentor  used  is  a  modified  version  of  the 
University  of  Maryland's  Superslice  [3,20,21]  algorithm.  Su¬ 
perslice  views  objects  as  being  distinct  from  their  surroundings 
by  the  presence  of  edges  at  their  boundary.  Superslice  starts  by 
choosing  a  threshold  to  segment  the  image  into  a  set  of  con¬ 
nected  components.  It  then  extracts  an  edge  map  from  the  im¬ 
age.  Finally,  it  measures  the  percent  of  each  component’s  border 
which  coincides  with  the  edge  map.  An  extracted  blob  is  one 
which  produces  high  edge-border  coincidence.  A  number  of 
thresholds  are  usually  tried  before  good  edge-border  coinci¬ 
dence  is  obtained.  Different  thresholds  may  be  required  for  ex¬ 
tracting  different  objects  in  the  same  scene.  Minor  and 
Skiansky’s  version  of  Superslice  also  uses  edge  direction  in  the 
measure  of  coincidence. 
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Figure  8a.  Spoke  Filter  for 
Bright  Objects  on 
a  Dark  Back¬ 
ground  (From  an 
Unpublished  Ver¬ 
sion  of  (4]  ) 
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Figure  8b.  Spoke  Filter  for 
Dark  Objects  on 

a  Bright  Back¬ 
ground 
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A  group  at  Hughes  [6]  briefly  describes  a  method  which  tests 
each  pixel  in  a  region  of  interest  for  boundedness  by  edges  in 
eight  directions.  A  pixel  is  labeled  to  be  interior  to  some  bound¬ 
ary  if  it  is  bounded  by  edges  in  six  of  the  eight  directions.  The 
labeled  image  is  then  thresholded  to  extract  a  target  interior.  If 
Y  s  then  pixels  with  intensity  greater  than  threshold  C  are 
selected,  otherwise  pixels  with  intensity  less  than  C  are  chosen. 
Finally,  a  thinning  and  filling  operation  is  used  to  remove  small 
holes  and  smooth  the  segmented  region.  The  Hughes  report 
supplies  no  statistical  model  from  which  their  thresholding 
equation  is  derived. 

The  author  developed  another  algorithm  based  upon  the  ba¬ 
sic  spoke  filtering  philosophy.  The  algorithm  uses  sets  of  3x3 
pixel  templates  -  with  each  template  designed  to  match  a  section 
of  a  particular  object’s  silhouette.  The  operation  of  the  algo¬ 
rithm  will  be  described  by  example. 

First,  suppose  that  an  algorithm  is  to  be  designed  to  detect 
octagons  of  known  polarity.  A  4-bit  image  size  buffer  (initially 
zeroed)  will  be  used  to  record  detections.  The  algorithm  will 
start  by  scanning  across  each  of  the  rows  of  an  image,  applying 
a  3x3  vertical  edge  detector  at  each  point  visited.  When  a  left 
blob  edge  is  detected,  its  position  will  be  noted  (Figure  9a). 
The  scan  will  continue  until  a  right  blob  edge  is  detected  (Figure 
9b).  At  this  time,  a  00012  will  be  OR’d  with  the  contents  of  the 
buffer  at  a  location  corresponding  to  the  midpoint  between  de¬ 
tected  edges  (figure  9c).  This  process  is  repeated  upon  the  image 
columns  and  in  the  two  diagonal  directions  using  the  markers 
shown  in  Figure  10a. 

At  the  completion  of  the  scans  in  the  four  directions,  we 
would  expect  to  find  a  1 1 1 1 2  at  buffer  locations  corresponding 
to  centers  of  octagons  in  the  original  image.  Since  this  may  not 
always  occur,  due  to  the  discrete  nature  of  the  image  geometry, 
a  3  x  3  OR  filter  is  convolved  with  the  detection  buffer. 

Now  suppose  that  we  want  to  detect  military  vehicles.  Instead 
of  applying  edge  detectors  in  the  four  scan  directions,  feature 
detectors  will  be  applied  which  correspond  to  shapes  located 
around  the  sought  vehicles’  silhouettes.  Also,  the  diagonal  scan 
directions  should  not  be  exactly  43  degrees  to  image  rows,  but 
rather  should  be  related  to  the  average  height  to  width  ratio  of 
targets  (Figure  10b). 
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Figure  10a.  Templates  and  Scanning  Directions  for 
Octagon  Detector 


Figure  10b.  Templates  and  Scanning  Directions  for  Military 
Vehicle  Detector  (A  Succession  of  S  Features  Rep¬ 
resents  a  Particular  Template) 

Upon  completion  of  the  four  scans,  a  spatial  clustering  algo¬ 
rithm  is  applied  to  detections,  with  cluster  points  replaced  by 
their  cluster  centroid. 

The  algorithm  performs  very  well  as  a  detector  as  seen  in 
Figure  6.  Detection  rate  was  67  percent  with  0.69  false  alarms 
per  target.  The  algorithm  was  also  evaluated  as  a  segmentor 
(Figure  7)  with  the  distances  between  a  centroid  and  the  hori¬ 
zontal  and  vertical  feature  detections  used  to  estimate  target 
radii.  Segmentation  accuracy  was  not  very  good,  with  targets 
usually  judged  smaller  than  they  actually  were. 

BORDER  FOLLOWING/EDGE  BASED  SEGMENTATION 

A  blob  outline  containing  no  breaks  is  a  closed  curve  separat¬ 
ing  the  blob  from  its  background.  This  curve  may  pass  through 
the  same  pixel  twice  if  the  blob  has  a  narrow  neck  near  or  at  this 
pixel.  A  number  of  border  following  algorithms  are  described 
in  the  image  processing  literature,  e.g.,  see  [7-10].  They  start  by 
locating  a  prominent  border  point,  and  continue  by  visiting  ad¬ 
jacent  border  points  (edge  elements)  in  sequence,  eventually  re¬ 
turning  to  the  starting  point.  Difficulties  arise  when  an  object's 
boundary:  is  not  distinct  over  its  entire  length. 

Schenker  and  Cooper  [7,9J  describe  an  elaborate  algorithm 
which  seeks  the  mot  i  probable  blob  boundary  by  conducting  an 
exhaustive  search  over  all  boundaries.  The  "true  Boundary”  is 
then  viewed  as  the  one  which  maximizes  the  joint  ikelihood  of 
the  observed  data  and  a  hypothesized  boundary.  Since  an  ex¬ 
haustive  search  proves  to  be  computationally  prohibitive,  sub- 
r  'timal  search  algorithms  arc  also  proposed. 

Perkins  [IJJ,  while  with  General  Motors  Research  Lab,  de- 
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veloped  an  object  detection/segmentation  algorithm  for  use  in 
a  robot  vision  system.  The  algorithm  is  implemented  by  a  se¬ 
quence  of  simple  operations.  First,  an  edge  map  is  obtained. 
Next,  uniform  intensity  regions  are  extracted  by  expanding  ac¬ 
tive  edge  regions,  labeling  the  segmented  uniform  intensity  re¬ 
gions,  and  then  contracting  the  edge  regions.  The  region  with 
the  most  points  along  the  image  border  is  assumed  to  be  the 
background.  Finally,  foreground  region  boundary  points  are 
visited;  the  locations  of  points  which  are  roughly  equidistant 
along  a  boundary  are  stored  in  an  array.  The  performance  of 
Perkins’  algorithm  as  a  FLIR  target  detector  is  not  particularly 
good  (Figure  6).  The  detection  rate  over  the  test  data  base  was 
only  SO  percent.  The  algorithm  as  implemented  at  GM  does  not 
use  image  polarity  and,  as  a  consequence,  sometimes  misses 
nearby  targets,  while  indicating  a  detection  half-way  between 
them.  The  performance  of  the  algorithm  as  a  segmentor  is  ex¬ 
cellent  (Figure  6). 

A  simple  border  follower  was  programmed  and  tested  at 
Westinghouse.  A  detection  rate  of  61  percent  was  obtained  at  a 
very  high  false  alarm  rate  (Figure  7). 

RELAXATION  ALGORITHMS 

Rosen f eld  et  al.  [12,  19.  23],  Kirby  [13],  Peleg  [14],  and 
others  have  viewed  image  segmentation  as  a  graph  labeling 
problem.  The  proposed  approximate,  iterative,  parallel  solu¬ 
tions  are  called  “relaxation  methods”  which  work  as  follows. 
First  an  initial  set  of  labels  is  assigned  to  each  image  node  based 
upon  local  image  properties.  The  labels  at  each  node  are  given 
weights  between  0  and  1 .  This  initial  set  of  assignments  may  be 
ambiguous  or  incorrect  in  spots  because  of  the  imperfect  nature 
of  the  local  image  measures  or  noise  in  the  image.  An  iterative 
process  using  information  obtained  from  neighbors  is  then  used 
to  improve  the  initial  decisions.  The  weights  at  each  node  are 
simultaneously  updated  at  each  iteration  (typically)  based  upon 
the  weights  at  node  neighbors  during  the  previous  iteration. 
This  has  the  effect  of  altering  the  probabilities  initially  assigned 
to  noise  points  to  make  them  more  consistent  with  their  sur¬ 
round.  The  process  is  stopped  when  the  labeling  of  all  nodes 
seems  to  reach  a  steady  state. 

To  apply  2-label  relaxation  to  segmentation  [23],  a  set  of 
“light”  and  “dark”  probabilities  is  assigned  to  image  nodes 
based  upon  their  gray  levels.  Probabilities  at  each  node  are  then 
iteratively  adjusted  based  upon  probabilities  at  neighboring 
nodes,  i.e.,  light  reinforces  light  and  dark  reinforces  dark. 
Eventually,  the  blob  pixels  should  become  uniformly  light,  and 
the  background  uniformly  dark,  so  that  segmentation  is  easy. 
The  blob  region  is  extracted  by  locating  connected  components. 
A  3-label  scheme  uses  labels  of  “object”,  “background”,  and 
“clutter”. 

The  relaxation  algorithms  tested  performed  rather  poorly  as 
target  detectors  (Figure  6). 

PYRAMID  APPROACHES 

Let  the  size  of  an  image  be  2“  x  2“  pixels.  Reduced  resolution 
versions  can  be  linked  with  a  pyramid  data  structure.  The  top 
(level  0)  of  the  pyramid  will  be  a  size  1  x  1  image.  The  original 
2"  x  2“  image  will  be  at  the  bottom  of  the  pyramid.  At  level  K 
will  be  an  image  of  size  2k  x  2k.  A  number  of  different  schemes 
have  been  developed  for  detecting  blobs  with  pyramids  [24], 

The  pyramid  linking  scheme  of  Burt,  Hong,  and  Rosenfeld 
[23]  works  as  follows.  A  father-son  relationship  is  defined  be¬ 
tween  nodes  of  adjacent  levels  of  the  pyramid.  Each  node  at 
level  K  is  the  father  of  a  2  x  2  array  of  “candidate  son”  nodes 
at  level  K+ 1.  Likewise,  each  node  at  level  K  is  the  ton  of  four 
“candidate  father”  nodes  at  level  K-l.  At  each  iteration  of  the 
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pyramid  linking  algorithm,  a  node  is  linked  to  the  “most  simi¬ 
lar”  of  its  four  candidate  fathers.  A  node  is  then  assigned  the 
average  gray  level  of  those  sons  linked  to  it.  The  process  con¬ 
tinues  in  this  manner  until  a  steady  state  is  reached.  The  links 
then  define  trees,  and  the  leaves  of  “light”  trees  are  target  seg¬ 
ments.  The  particular  pyramid  linking  algorithm  tested  did  not 
perform  well  as  a  detector,  yielding  too  many  false  alarms* 
(Figure  6). 

MODE  SEEKERS 

There  is  a  class  of  techniques  in  which  a  pixel  is  iteratively 
replaced  by  the  average  of  a  selected  set  of  its  neighbors.  Naray¬ 
anan  and  Rosenfeld  [29]  describe  a  method  that  chooses  those 
neighbors  for  averaging  which  belong  to  the  same  histogram 
peak  as  the  given  pixel.  The  simplest  version  of  this  method 
chooses  those  neighbors  higher  up  on  the  image  histogram 
peak.  This  tends  to  move  pixel  gray  levels  toward  their  subpop¬ 
ulation  modes.  An  improved  version  chooses  a  neighbor  only  if 
there  is  no  significant  concavity  in  the  histogram  between  it  and 
the  given  pixel.  Ideally,  upon  termination,  there  will  be  a  two- 
spiked  histogram,  with  the  lighter  spike  formed  from  target 
pixels.  Thrget  regions  are  then  obtained  by  extracting  connected 
components. 

A  local  version  of  the  algorithm  performs  processing  over 
relatively  small  image  windows  in  sequence,  while  the  global 
version  (named  Global  Superspike)  uses  the  histogram  for  the 
entire  image.  The  global  version  was  tested  since  the  images  in 
the  data  base  are  only  128  x  128  pixels  in  size.  Its  detection  rate 
was  88  percent  (Figure  6). 

EVALUATION 

The  mode  seeker  had  the  highest  detection  rate  of  any  of  the 
methods  tested.  Its  segmentation  accuracy  fell  in  the  65  percent 
to  75  percent  range,  as  did  most  of  the  other  methods. 

Spoke  filters  perform  very  well  as  detectors.  But  they  have  a 
low  segmentation  accuracy,  and  may  need  to  be  followed  by  a 
separate  segmentor  such  as  Superslice. 

Double  window  filters  work  well  as  detectors.  They  can  also 
be  used  for  segmentation  if  only  target  extents  are  required.  But 
preferably,  they  should  be  designed  to  seek  out  targets  of  partic¬ 
ular  sizes  if  range  is  available  as  an  ir,,-vir.  if  targets  of  several 
sizes  and  orientations  are  sought,  then  a  different  filter  is  re¬ 
quired  for  each.  Each  filter  must  vary  in  size  down  the  image  to 
correct  for  perspective.  This  has  a  multiplicative  effect  on  hard¬ 
ware  size  and  cost. 

Simple  border  followers  must  be  run  at  a  high  false  alarm 
rate  to  yield  a  reasonable  detection  rate.  They  are  well  suited  for 
segmentation,  yielding  estimates  of  target  borders.  If  an  entire 
image  is  available  for  processing,  the  border  following  can  start 
at  the  most  prominent  border  point  and  proceed  from  there.  If 
an  image  is  to  be  processed  one  line  at  a  time  (on-the-fly),  then 
the  topmost  target  point  must  be  detected  to  initialize  the  fol¬ 
lower.  This  second  approach  may  have  difficulty  with  non -con¬ 
vex  targets  but  is  easier  to  build  into  hardware. 

Border  followers  and  double  window  filters  can  be  readily 
implemented  in  hardware.  However,  as  noted  above,  proper  im¬ 
plementation  of  the  double  window  filter  requires  considerable 
computation  power.  The  spoke  fiber  in  its  purest  form,  as  pro¬ 
posed  by  Minor  and  Sklansky,  is  not  well  suited  for  hardware 


*  Note:  Another  pyramid  scheme  was  recently  tested  at  the 
University  of  Maryland  on  this  data  with  much  better 
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implementation.  It  is  computationally  expensive  requiring  the 
repeated  access  of  a  large  array  of  edges  (or  pixels)  as  the  filter 
sweeps  over  the  image.  The  algorithm  can,  however,  be  reform¬ 
ulated  in  a  number  of  different  ways  to  meet  the  requirements 
of  particular  computer  architectures. 

Relaxation  algorithms  are  best  implemented  in  a  cellular  ar¬ 
ray  architecture  having  one  processor  per  pixel.  No  one  has  ever 
built  a  target  acquisition  system  using  this  architecture;  to  do  so 
must  therefore  be  viewed  as  a  risky  venture.  The  same  holds 
true  for  pyramid  approaches. 

There  are  two  difficulties  with  implementing  the  mode  seeker. 
A  FLIR  image  is  typically  875  x  875.  A  mode  seeker  should  be 
implemented  on  smaller  image  subregions,  possibly  horizontal 
strips  (range  zones).  Secondly,  the  required  post-processing  step 
of  extracting  connected  components  is  rather  difficult  to  imple¬ 
ment  in  real-time  hardware.  However,  the  image  smoothing  al¬ 
gorithm  [29],  upon  which  the  mode  seeker  is  based  would  be  an 
excellent  preprocessor  for  any  detection  algorithm.  An  analysis 
of  its  hardware  implementability  is  not  yet  complete. 

CONCLUSIONS 

The  following  conclusions  were  drawn  from  the  study. 

•  Mode  seekers,  double  window  filters,  and  spoke  filters  all 
show  promise  as  targets  detectors.  Border  followers  and 
possibly  mode  seekers  (if  hardware  implementation  can  be 
worked  out)  are  fair  segmentors. 

•  No  existing  computer  architecture  appears  appropriate  for 
the  real-time  implementation  (on  one  or  two  circuit 
boards)  of  all  detection/segmentation  algorithms.  A  spe¬ 
cial  architecture  is  required  for  the  efficient  implementa¬ 
tion  of  each  particular  algorithm. 

It  was  originally  intended  to  test  an  artificial  intelligence  (AI) 
approach  to  improving  algorithm  performance.  This  appears  to 
be  much  more  difficult  than  initially  thought.  The  investigation 
has  not  revealed  any  real-time  target  acquisition  system  which 
uses  AI  concepts.  It  is  concluded  that  for  the  present  efforts  are 
better  directed  to  the  use  of  available  information,  such  as 
range  and  feedback  data.  At  a  later  time,  an  attempt  can  be 
made  to  incorporate  a  simple  knowledge  base  and  a  limited  rea¬ 
soning  ability.  Several  possible  examples  might  be: 

•  Ground  vehicles  are  likely  to  be  located  below  the  horizon. 

•  Targets  of  the  same  type  often  appear  in  groups,  sometimes 
in  moving  columns  on  or  near  roads. 

•  Gathered  intelligence  indicating  presence  of  different  tar¬ 
get  types  is  often  available. 

A  STRATEGY 

A  sound  strategy  is  to  precede  detection  by  preprocessing 
filter s  designed  to  supress  noise  and  “bring  out”  targets.  The 
Narayan  filter  may  be  of  use  here.  This  should  be  followed  by 
an  initial  detection  algorithm  which  is  cheap  to  implement  in 
hardware,  but  which  is  not  necessarily  a  “star”  performer.  This 
algorithm  should  be  run  at  a  low  threshold  (i.e.,  high  false 
alarm  rate)  to  make  sure  that  most  targets  are  detected.  This 
should  be  followed  by  a  more  expensive  detection/segmenta¬ 
tion  algorithm  which  will  operate  only  on  the  initial  detections. 
Performance  should  be  improved  by  target  tracking,  frame-to- 
frame  integration  of  extracted  data,  and  decision  smoothing. 
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ABSTRACT 

objects  of  arbitrary  size  are  extracted 
from  images  using  a  combination  of  three-pyramid- 
based  representations  of  image  features.  A  gray¬ 
scale  linked  pyramid  is  used  to  smooth  the  Image 
into  uniform  regions.  A  "surroundedness"  pyramid 
is  used  to  identify  regions  of  Interest,  and  a 
linked  edge  pyramid  is  used  to  delimit  the  bound¬ 
aries  of  the  compact  objects. 


1.  Introduction 


The  gray-scale  pyramid  is  used  to  segment  the 
original  image  into  smooth  regions  that  are  not 
necessarily  connected  (Section  2.1).  It  is  common 
for  an  object  to  belong  entirely  to  one  of  these 
regions,  but  the  algorithm  does  not  require  this 
to  be  the  case.  The  edge  pyramid  (Section  2.2)  is 
used  in  two  ways.  The  edges  indicate  parts  of  the 
image  that  could  be  individual  objects,  enabling 
the  objects  to  be  separated  from  the  regions  ex¬ 
tracted  by  the  gray-level  pyramid.  The  edges  also 
serve  as  the  basis  for  constructing  the  pyramid  of 
surroundedness  scores  (Section  2.3). 


Many  image  processing  casks  require  the  extrac¬ 
tion  of  objects  from  a  background.  Most  notable 
among  these  is  target  detection.  In  many  cases 
there  is  some  a  priori  knowledge  about  the  shapes 
and  sizes  of  the  objects,  which  could  aid  in  their 
extraction.  Unfortunately,  it  has  not  normally 
been  possible  to  extract  objects  that  have  the 
right  size  and  shape  without  extracting  other,  un¬ 
wanted  objects  as  well.  Removing  the  unwanted 
objects  then  requires  another  stage  of  processing, 
which  can  be  very  complicated  if  the  desired  ob¬ 
jects  are  embedded  in  background  clutter. 

This  paper  presents  a  pyramid-based  method  of 
extracting  compact  objects  that  is  able  to  apply 
knowledge  about  the  size  and  shape  of  an  object 
directly  to  the  segmentation  process,  to  avoid  ex¬ 
tracting  unwanted  regions.  The  method  provides 
solutions  to  a  group  of  problems,  including  object 
detection,  edge  completion,  and  region  filling.  It 
makes  use  of  both  gray-scale  and  edge  Information. 
In  addition,  it  computes  a  surroundedness  measure 
for  each  pixel,  representing  the  degree  to  which 
that  pixel  is  locally  surrounded  by  edges.  All 
three  sets  of  information  -  gray  level,  edge  magni¬ 
tude  and  direction,  and  surroundedness  -  are  rep¬ 
resented  in  pyramid  structures,  and  it  is  the 
Interaction  between  the  different  types  of  informa¬ 
tion  at  each  level  of  each  pyramid  that  leads  to 
the  final  segmentation.  The  representations  are, 
that  selves,  built  on  one  another.  A  gray-level 
pyramid  is  used  to  construct  an  edge  pyramid, 
which  is  in  turn  used  to  construct  a  surroundedness 
pyramid. 
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The  surroundedness  scores  are  used  to  find 
starting  points  for  a  combined  region-growing  and 
region-splitting  process.  The  growth  of  the  re¬ 
gion  is  controlled  by  the  gray-level  pyramid,  and 
the  region  is  pruned  by  the  edge  pyramid.  In  this 
sense,  the  method  is  analogous  to  the  "superslice" 
algorithm  (Mllgram,  1979)  and  to  the  relaxation 
method  of  Danker  and  Rosenfeld  (1979).  One  of  the 
notable  features  of  the  method  is  that  the  region 
does  not  "leak"  through  holes  in  its  border.  This 
is  partly  because  of  the  pyramid's  tendency  to 
bridge  small  gaps  as  the  resolution  decreases  from 
level  to  level. 

A  previous  use  of  a  pyramid  process  for  ex¬ 
tracting  compact  objects  (Shneier,  1979)  made  use 
only  of  gray  values  and  a  compactness  measure.  For 
each  compact  region  that  was  discovered,  a  thres¬ 
hold  was  computed  and  applied  in  a  square  region 
of  the  original  image  to  extract  the  object.  The 
current  method  does  not  use  a  threshold  to  extract 
the  regions,  but  makes  use  of  edge  information  to 
determine  the  shapes  and  sizes  of  the  regions. 

The  process  of  constructing  the  pyramids  is 
described  in  Section  2,  and  the  succeeding  section 
describes  how  each  pyramid  is  used  to  arrive  at  the 
final  result.  Examples  are  given  of  applying  the 
system  to  a  set  of  images,  and  the  results  are  com¬ 
pared  with  those  obtained  in  a  recent  segmentation 
study  (Hartley  et  al..  1981). 

In  the  following  sections,  the  pixels  at  each 
level  of  the  various  pyramids  play  two  roles.  They 
are  points  in  an  image  at  some  level  of  a  pyramid, 
and  are  also  nodes  in  the  tree  structure  defined  by 
the  links  between  levels  in  the  pyramids.  Both 
names  will  be  used  interchangeably. 
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2.  Constructing  the  pyramids 

2.1  Gray-scale  pyramid 

A  gray-scale  pyramid  is  a  sequence  of  square 
images,  each  a  lower-resolution  version  of  its  pre¬ 
decessor.  The  kind  of  pyramid  used  in  this  work  is 
the  linked  structure  defined  by  Burt  et  al..  (1980). 
It  is  constructed  as  follows. 

Each  level  is  formed  by  summarizing  a  4  by  4 
neighborhood  in  the  preceding  level.  The  neighbor¬ 
hoods  are  overlapped  fifty  percent  vertically  and 
horizontally  so  that  each  pixel  has  four  "fathers" 
at  the  next  level,  and  sixteen  "sons"  at  the  pre¬ 
vious  level.  The  average  or  the  median  of  the  six¬ 
teen  sons  can  be  used  as  the  summarizing  value  for 
their  father.  In  the  implementation,  the  average 
value  was  used. 

The  entire  pyramid  is  constructed  in  this  way, 
up  to  the  level  at  which  there  are  only  four  pixels. 
There  follows  an  iterated  linking  process  in  which 
each  node  is  linked  to  that  one  of  its  four  fathers 
whose  gray  value  is  most  similar  to  its  own.  A 
father  can  thus  have  up  to  sixteen  sons,  while  a 
son  can  have  only  one  father.  After  the  links  have 
been  established,  each  node  recomputes  its  gray 
value  based  only  on  the  values  of  the  sons  linked 
to  it.  This  process  is  iterated,  and  usually  sta¬ 
bilizes  after  a  few  iterations. 

At  this  stage,  each  pixel  at  the  bottom  level 
of  the  pyramid  (the  original  image)  is  linked 
through  some  sequence  of  ancestors  to  one  of  the 
four  pixels  in  the  topmost  (2  by  2)  level  of  the 
pyramid.  Each  topmost  node  thus  represents  some 
region  in  the  original  image,  which  can  be  extract¬ 
ed  by  following  links  down  the  pyramid.  If  the 
values  of  pixels  in  the  original  image  are  re¬ 
placed  by  the  corresponding  values  of  their 
ancestors,  a  segmentation  of  the  Image  into  at 
most  four  regions  is  obtained.  It  is  not  necessary 
that  these  regions  be  connected. 

The  segmentation  defined  by  this  procedure  is 
not  necessarily  in  terms  of  objects  and  background. 
Indeed,  for  the  image  in  Figure  1,  the  chromosomes 
are  extracted  as  one  component,  while  the  back¬ 
ground  is  segmented  into  three  components  of  slight¬ 
ly  different  average  gray-value.  Often,  a  desired 
region  belongs  to  one  of  the  four  components,  but 
is  lost  among  the  other  parts  of  the  image  that 
link  to  the  same  component.  As  an  example,  notice 
that  one  of  the  small  chromosomes  in  Figure  lb 
disappears  entirely.  The  procedure  defined  in  this 
paper  is  largely  concerned  with  isolating  individual 
parts  of  the  four  components  into  separate  objects, 
although  it  is  also  able  to  merge  parts  from  dif¬ 
ferent  components  into  a  single  object.  The  pro¬ 
cess  relies  on  edge  and  surroundedness  information 
to  find  the  subcomponents  to  be  extracted. 

Figure  1  shows  an  image  and  the  results  of 
iterating  the  gray-level  linking  process.  The  re¬ 
sulting  preliminary  segmentation  forms  the  lmput  to 
the  rest  of  the  procedure. 


2.2  Edge  Pyramid 

The  edge  pyramid  is  constructed  by  first  Glid¬ 
ing  a  gray-level  pyramid  and  then  applying  an  edge 
operator  at  each  level  to  produce  an  edge  pyramid 
(Hong  et  al.,  1981).  The  gray-level  pyramid  used 
for  extracting  edges  was  based  on  non-overlapped  2 
by  2  blocks,  and  the  values  at  each  level  were  de¬ 
fined  as  the  medians  rather  than  the  means  of  the 
values  in  the  blocks  at  the  level  below.  This  re¬ 
duces  the  amount  of  blurring  and  distortion  of  the 
edges  (Tanlmoto,  1976). 

The  edge  operator  that  was  used  is  one  that 
scored  highest  in  the  edge  evaluation  tests  of 
Kitchen  and  Rosenfeld  (1981).  It  is  the  three-level 
template  operator  (Abdou  and  Pratt,  1979)  which  uses 
eight  direction  masks,  e.g. 

-10  1  -1-10 

-10  1  and  -10  1 

-10  1  011 

The  edge  detection  is  followed  by  a  non-maximum 
suppression  stage.  A  3  by  3  window  is  placed 
around  each  edge  point.  The  direction  of  the  edge 
is  used  to  find  the  two  edge  points  to  use  for  non¬ 
maximum  suppression.  If  the  edge  point  has  a  mag¬ 
nitude  greater  than  both  points,  and  a  direction 
difference  of  less  than  45  degrees,  it  survives; 
otherwise,  it  is  deleted.  Figure  2a  shows  an  edge 
pyramid  constructed  from  the  chromosome  image. 

Edges,  too,  are  linked  together  between  levels. 
For  linking  purposes,  the  pyramid  is  assumed  to  map 
each  point  to  a  4  by  4  region  in  the  level  below. 
Once  again,  each  son  has  four  potential  fathers  and 
each  father  has  sixteen  sons.  Linking  proceeds 
bottom-up.  Each  son  compares  his  direction  with 
those  of  his  four  fathers,  and  chooses  the  father 
whose  direction  is  most  compatible.  If  the  differ¬ 
ence  in  directions  is  less  than  some  threshold 
(here  46  degrees),  the  son  is  linked  to  the  father. 
Otherwise,  the  son  becomes  the  root  of  a  tree.  Ties 
are  broken  by  choosing  the  first  father  that  satis¬ 
fies  the  criteria.  The  direction  of  a  son  is  up¬ 
dated  to  become  the  average  of  the  son's  direction 
and  the  father's  direction,  but  the  process  is  not 
iterated. 

2.3  Surroundedness  pyramid 

The  edges  at  each  level  of  the  edge  pyramid 
are  directed  in  such  a  way  that  the  brighter  side 
of  the  edge  is  to  its  right.  This  information  could 
be  used  by  itself  to  prune  the  gray-level  pyramid  by 
demanding  that  the  gray  levels  at  positions  cor¬ 
responding  to  opposite  sides  of  an  edge  obey  this 
constraint.  Such  a  process  would  not  necessarily 
lead  to  a  segmentation  into  compact  objects.  It  is 
first  necessary  to  Identify  the  edges  that  bound 
compact  objects,  and  to  ignore  all  other  edges.  A 
procedure  for  finding  such  edges  was  described  in 
Hong  et  si.  (1981). 

In  the  current  system,  however,  the  aim  is  to 
extract  the  interiors  of  compact  regions.  The  pro¬ 
cess  is  applied  at  each  level  of  the  pyramid,  and 
compact  objects  of  different  sizes  are  identified 


59 


at  different  levels.  There  are  two  stages  involved 
in  finding  compact  regions  from  edge  information. 

First,  the  skeleton  of  a  region  is  found  by 
looking  at  5  by  5  neighborhoods  of  each  point. 

There  is  no  need  to  look  further  than  two  points 
on  either  side  of  a  pixel,  because,  if  chore  are  no 
edges  within  this  distance,  the  object  will  be¬ 
come  more  compact  at  the  next  higher  level  of  the 
pyramid,  where  the  process  is  applied  as  well.  The 
aim  is  to  find  interior  points  of  a  region  that 
are  surrounded  by  edge  points  with  compatible 
directions. 

Let  x  be  the  central  point  in  a  5  by  5  neigh¬ 
borhood  (Figure  3),  The  remaining  points  in  the 
neighborhood  are  divided  into  three  classes.  The 
points  marked  A  are  the  Immediate  neighbors  of  x, 
while  those  marked  B  and  C  are  more  distant  from  x. 
The  numbers  associated  with  each  point  are  their 
chain  code  orientations  in  units  of  45  degrees. 
Finding  the  skeleton  proceeds  as  follows. 

If  the  edge  magnitude  of  x  is  not  zero,  ignore 
this  point,  because  x  is  not  Interior. 

If  the  magnitude  is  zero,  check  the  neighbors 

of  x: 

1.  For  each  type  A  neighbor  of  x  whose  edge 
magnitude  is  not  zero,  the  edge  direction 
of  A  is  allowed  to  differ  from  its  chain- 
code  direction  by  no  more  than  some 
threshold  (here  23  degrees) .  For  example, 
the  edge  direction  of  the  point  immedi¬ 
ately  East  of  x  must  lie  between  -23 
degrees  and  +23  degrees,  while  the  edge 
direction  of  the  point  North-East  of  x 
must  lie  between  23  degrees  and  45  de¬ 
grees.  That  is,  the  edge  directions 
should  be  consistent  with  the  edgeB  of 

a  closed  region.  If  this  condition  is 
met,  the  score  for  the  particular  direc¬ 
tion  from  x  is  set  to  1.  The  score  is  a 
measure  of  how  central  the  point  is,  i.e., 
of  its  membership  in  the  skeleton  of  the 
region.  For  each  point,  there  are  eight 
slots  for  scores,  corresponding  to  eight 
directions.  A  perfect  border  around  x 
would  result  in  all  eight  slots  being  set 
to  1.  Note  that  more  than  one  point  in 
the  5  by  5  neighborhood  can  set  the  same 
slot  value. 

2.  If  the  magnitude  of  a  type  A  neighbor  of 
x  is  zero,  the  neighboring  type  B  point 
la  examined  as  above.  If  its  edge  direc¬ 
tion  is  compatible  with  its  grid  position, 
the  score  for  x  is  set  to  1. 

3.  For  all  type  C  points  whoae  edge  magnitude 
la  not  zero,  the  corresponding  direction 
alot  for  x  ia  set  to  1  if  the  direction  of 
the  point  is  within  23  degrees  of  the 
chain-code  position.  For  type  C  points, 
however,  the  chain-code  direction  is  cal¬ 
culated  at  45  *  chain-code  number  +  23, 
because  type  C  points  are  offset  an  extra 
23  degrees  from  x. 


Notice  that  all  the  type  A  and  type  C  points  con¬ 
tribute  to  the  score  for  x,  but  type  B  points  only 
contribute  if  the  neighboring  type  A  point  is  not 
an  edge  point.  This  is  because  closer  edges  are 
assumed  to  block  the  effects  of  edges  that  are  more 
distant,  and  hence  less  likely  to  belong  to  the  same 
object.  This  is  particularly  Important  at  high 
levels  of  the  pyramid  where  the  objects  are  very 
close  together. 

When  the  scoring  process  has  been  applied  to 
each  5  by  5  neighborhood  at  each  level  in  the  pyra¬ 
mid,  the  second  stage  of  finding  compact  regions  is 
performed.  The  purpose  of  the  second  stage  is  to 
propagate  the  score  of  the  skeleton  out  to  the  bor¬ 
ders  of  the  region.  For  the  second  stage,  the 
score  is  computed  as  the  sum  of  the  slot  values. 

A  threshold  is  applied  to  decide  what  score  values 
are  considered  to  constitute  valid  skeleton  points 
(here  a  score  of  5  out  of  a  possible  8  was  used) . 

For  each  such  point  the  following  procedure  is  per¬ 
formed  . 

1.  For  all  type  A  or  C  points  whose  edge  mag¬ 
nitudes  are  not  zero  and  whose  edge  direc¬ 
tions  are  compatible  (as  in  the  previous 
step) ,  assign  a  new  score  which  is  the  max¬ 
imum  of  the  current  score  and  the  sum  of 
the  slot  values  for  x  (the  skeleton  point). 

2.  For  type  A  points  whose  edge  magnitude  is 
zero,  check  the  corresponding  type  B 
point.  If  its  magnitude  is  not  zero  and 
its  direction  is  compatible,  assign  a  new 
score  to  both  the  type  A  point  and  the 
type  B  point.  In  each  case  the  score  is 
the  maximum  of  the  score  for  x  and  the 
current  score  for  the  point. 

When  both  steps  of  the  process  have  been  com¬ 
pleted,  each  compact  region  will  contain  a  set  of 
high  scores,  as  will  the  edge  points  surrounding 
the  region  (Figure  2b).  These  points  define  the 
extent  of  the  region  at  the  particular  level  in  the 
pyramid.  To  extract  the  corresponding  region  in 
the  original  Image  requires  the  use  of  both  the 
gray-level  and  the  edge  pyramids.  The  particular 
scoring  function  used  does  not  have  any  special 
significance,  and  It  is  likely  that  other  functions 
would  perform  equally  well. 

Note  that  no  thresholding  was  used  to  discard 
edges  with  very  low  magnitudes.  It  is  sometimes 
useful  to  keep  only  the  strong  edges,  and  so  avoid 
extracting  objects  with  very  low  contrast  that  are 
invisible  to  the  human  eye.  To  a  large  extent  the 
loss  of  resolution  at  higher  levels  of  the  pyramid 
achieves  this  automatically,  but  it  is  true  that  at 
low  levels  in  the  pyramid  a  lot  of  small  noise  re¬ 
gions  might  be  extracted.  Examples  of  the  improved 
performance  resulting  from  thresholding  the  edge 
magnitudes  are  shown  in  Section  4. 

The  surround edness  pyramid  has  no  links  be¬ 
tween  the  levels.  As  a  result,  compact  objects  can 
be  detected  at  more  than  one  level  of  the  pyramid. 

In  previous  work  (Bong  et_  al. ,  1981)  links  were 
established,  and  the  object  was  detected  at  the 
highest  level  at  which  it  was  well  defined.  Such 
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a  process  would  probably  work  for  the  current  pyra¬ 
mid  structure  as  well. 

3 .  Extracting  the  compact  regions 

The  most  obvious  way  of  extracting  compact  re¬ 
gions  from  a  given  level  in  the  surroundedness 
pyramid  Is  first  to  find  all  points  that  have  a 
high  surroundedness  score.  These  points  can  then 
be  projected  down  to  the  base  level  by  finding  the 
corresponding  points  in  the  gray-level  pyramid  and 
following  their  links.  Unfortunately,  this  simple 
process  results  in  regions  that  are  displaced, 
misshapen,  and  which  have  holes  and  protrusions 
that  do  not  appear  in  the  original  objects. 

There  are  a  number  of  reasons  for  these  imper¬ 
fections,  analysis  of  which  leads  to  a  more  complex 
extraction  process,  but  one  that  produces  regions 
that  are  much  closer  to  the  actual  shapes  of  the 
objects.  The  flaws  can  arise  from  a  poor  initial 
segmentation  in  the  gray-level  pyramid  and  from 
displaced  or  missing  edges  in  the  edge  pyramid. 

Poor  edge  data  also  lead  to  incorrect  surrounded¬ 
ness  information,  and  this  also  must  be  improved. 

The  compact  region  developed  from  the  edge 
Information  can  be  Incorrect  for  two  reasons. 

First,  the  edges  could  be  misplaced  due  to  the 
averaging  in  the  pyramid  process  and  the  non- 
maximum  suppression  applied  at  each  level.  Second, 
there  may  be  missing  or  noisy  edges.  To  correct 
the  placement  of  the  points,  use  is  made  of  infor¬ 
mation  from  the  gray-level  pyramid.  If  the  object 
is  known  to  have  a  particular  color,  then  all 
points  with  that  color  that  are  in  the  compact  re¬ 
gion  (i.e.  have  a  high  surroundedness  score)  can  be 
called  object  points,  and  the  rest  can  be  ignored. 
Alternatively,  if  the  object  is  known  to  be,  say, 
the  brightest  region  in  the  image,  the  gray  value 
of  the  brightest  node  of  the  2  by  2  level  in  the 
gray-level  pyramid  can  be  projected  down  and  inter¬ 
sected  with  the  compact  points  to  give  a  more  accu¬ 
rate  compact  region.  Usually,  however,  the  rela¬ 
tive  brightness  of  the  object  is  not  known,  or  the 
object  may  have  more  than  one  color,  so  that  a  more 
conservative  approach  has  to  be  taken.  This  in¬ 
volves  a  local  process  to  identify  the  set  of  gray 
values  that  occur  most  coiHonly  in  the  interior  of 
the  compact  region.  These  values  are  taken  as 
representing  the  object,  and  neighboring  points 
with  the  same  gray  values  are  added  to  the  starting 
aet  to  give  a  new  compact  region  whose  position  is 
more  accurate  because  it  is  derived  both  from  edge- 
and  region-based  properties. 


regions  because  of  noise  in  the  image.  For  most 
applications  that  were  implemented,  no  edge  magni¬ 
tude  thresholding  was  performed,  and  for  all  appli¬ 
cations,  no  thresholding  was  performed  above  the 
base  level  of  the  pyramid.  As  a  result,  edge  points 
with  very  low  magnitude  often  appear  in  the  interior 
of  objects.  Again,  by  taking  advantage  of  the 
pyramid  process,  it  is  possible  to  remove  these 
edges  and  so  ensure  that  holes  do  not  appear  inside 
the  objects.  A  characteristic  of  noisy  edges  is 
that  they  do  not  survive  as  the  resolution  of  the 
image  is  reduced  at  successive  pyramid  levels.  By 
examining  the  sons  of  interior  points  and  deleting 
those  that  are  edge  points,  the  interior  of  the 
region  can  be  cleaned  up.  Of  course,  it  is  possible 
for  holes  that  are  real  features  to  be  eliminated 
in  this  way,  and  it  is  likely  that  edges  with  mag¬ 
nitudes  above  some  threshold  should  be  retained. 

The  final  difficulty  of  using  naive  projection 
to  find  the  compact  regions  is  that  the  objects 
that  are  found  have  misshapen  boundaries.  In  some 
places,  the  boundary  might  extend  into  the  back¬ 
ground,  while  in  others  it  might  not  extend  out  to 
the  actual  border.  It  is  even  possible  for  the 
simple  projection  process  to  give  rise  to  disjoint 
regions  at  the  base  of  the  pyramid.  This  problem 
is  overcome  by  projecting  down  the  gray  values 
level  by  level,  and  using  the  edges  at  successive 
levels  to  delimit  the  borders. 

The  process  of  extracting  regions  involves  a 
simultaneous  addition  and  deletion  of  nodes  in  the 
gray-level  pyramid,  guided  by  the  edge  and  surround¬ 
edness  pyramids.  Nodes  are  added  if  they  are  on 
the  interior  side  of  an  edge  and  adjacent  to  a 
compact  point.  They  are  deleted  if  they  are  on 
the  outside  of  an  edge  belonging  to  the  compact 
object.  The  additions  and  deletions  are  performed 
top-down  at  each  level  of  the  pyramid  below  the 
level  at  which  the  compact  object  was  discovered. 

The  result  is  a  region  whose  outline  closely  follows 
the  edge  bounding  the  object,  and  which  is  tolerant 
of  gaps  in  the  edge  information.  This  is  similar 
to  the  process  described  by  Strong  and  Rosenfeld 
(1973),  but  occurs  vertically  across  levels  of  the 
pyramid,  instead  of  horizontally  within  a  level. 

In  more  detail,  the  process  is  as  follows. 

1.  Project  the  gray  values  from  the  top  '2  by 
2)  level  of  the  pyramid  down  to  the  level 
at  which  the  compact  object  was  discovered 
(i.e.,  the  level  at  which  it  received  an 
above-threshold  surroundedness  score) . 

Call  this  level  L. 


To  correct  for  missing  edges,  the  structure  of 
the  edge  pyramid  is  used.  As  the  resolution  of  the 
pyramid  increases  towards  its  base,  the  positions  of 
the  edges  become  more  and  more  accurate,  but  the 
gaps  become  larger  and  larger.  By  fitting  lines 
through  existing  edge  points  in  a  top-down  process, 
the  gaps  can  be  filled  in  relatively  cheaply,  and 
should  approximate  the  actual  contours  of  the  bound¬ 
ary  more  and  more  closely  as  the  resolution  in¬ 
creases. 

Another  problem  that  arises  from  using  edge 
information  is  that  holes  can  appear  inside  object 


2.  Choose  points  to  be  considered  as  part  of 
the  object  from  among  the  points  belonging 
to  the  compact  region  as  follows.  For 
every  point  x  in  level  L  that  is  an  inter¬ 
ior  point  (i.e.,  has  a  high  score  and  is 
not  an  edge  point) ,  examine  the  surround¬ 
ing  5  by  5  window.  If  x  has  the  same 
value  as  the  majority  of  its  neighbors, 
then  x  is  considered  a  valid  object  point. 
This  ensures  that  points  that  have  gray 
values  that  belong  to  the  background,  or 
have  a  mixture  of  the  region  and  background 
colors,  are  not  Included. 


3.  Expand  the  set  of  points  belonging  to  the 
compact  object  by  again  looking  at  5  by  5 
neighborhoods,  this  time  for  all  points  x 
at  level  L,  regardless  of  whether  they  are 
interior  points  or  not.  If  x  has  neigh¬ 
bors  In  the  5  by  5  region  that  were  chosen 
as  object  points  in  the  previous  step, 
then  x  is  marked  as  an  object  point  if  x 
has  the  same  gray  value  as  one  of  those 
points.  This  compensates  for  shifts  in 
the  edge  positions  due  to  the  pyramid 
process  and  the  non-maximum  suppression. 

4.  Project  the  nodes  in  the  enlarged  compact 
region  down  one  level  in  the  gray-level 
pyramid,  to  level  L-l. 

5.  Examine  interior  points  of  the  compact 
region  in  the  edge  pyramid  at  level  L.  If 
any  of  the  central  four  sons  of  an  lnterlcr 
point  are  edge  points,  delete  them.  This 
cleans  out  noisy  edge  points  in  the  in¬ 
terior  of  the  object  at  level  L-l. 

6.  At  level  L-l,  expand  the  compact  region 

by  examining  edge  points  that  link  all  the 
way  to  level  L.  If  these  edge  points  have 
interior  neighbors  that  are  not  part  of 
the  region,  add  them  In  regardless  of 
their  gray  value.  This  expands  the  re¬ 
gion  to  fit  the  boundary  at  the  current 
level. 

7.  Fit  lines  through  the  edge  points  of  7  by 
7  neighborhoods  at  level  L-l  (see  below). 
Delete  points  that  lie  outside  these  lines 
if  they  are  part  of  the  compact  region. 

This  ensures  that  the  region  does  not  grow 
outside  the  edge  boundary,  and  prevents 
leaks  where  no  edges  exist. 

8.  Repeat  steps  4-7  for  levels  L-2,  L-3,... 
until  the  bottom  of  the  pyramid  is  reached. 
At  this  stage,  the  compact  region  has  been 
extracted. 

Lines  are  fitted  to  edge  points  below  level  L 
to  fill  in  gaps  in  the  edges.  For  every  edge  point 
x  that  links  to  the  border  of  an  object  at  level  L, 
a  set  of  points  (e.g.,  those  marked  a  in  Figure  4 
and  their  rotations)  is  examined  if  x  satisfies  the 
following  conditions. 

1.  x  must  not  be  surrounded  by  interior 
points.  This  assumes  that  the  objects  do 
not  have  holes  in  them,  and  can  be  relaxed 
if  necessary. 

2.  There  is  no  edge  parallel  to  x  in  the  area 
marked  by  a’ a  in  Flgura  4.  This  is  be¬ 
cause  the  parallel  edge  will  prune  the 
region  and,  since  edge  magnitudes  were  not 
used,  the  outermost  edge  is  considered  the 
real  edge  at  the  current  level. 

If  both  conditions  are  satisfied,  all  the 
points  marked  a  are  pruned.  In  the  implementation, 
points  were  only  deleted  if  they  did  not  link  to 
sot  compact  object.  This  was  because  all  compact 


regions  were  being  extracted  simultaneously,  and  it 
was  possible  for  points  from  a  different  object  to 
appear  in  the  neighborhood,  especially  at  high 
levels  in  the  pyramid. 

The  reason  for  projecting  the  values  from  the 
2  by  2  level  to  level  L  and  using  the  set  of  points 
that  have  the  most  common  gray  values  is  to  allevi¬ 
ate  effects  that  the  edge  construction  process  has 
on  the  position  of  edges.  Assume  that  an  object  is 
represented  mostly  by  a  single  gray  value  in  the 
original  image,  and  that  this  consistency  is  pre¬ 
served  at  all  levels  of  the  pyramid.  Then,  so  long 
as  the  edges  do  not  shift  too  far,  the  Intersection 
of  the  compact  region  and  the  set  of  points  with 
the  most  cannon  gray  values  is  a  good  seed  for 
growing  the  region.  Adding  in  points  that  are  im¬ 
mediate  neighbors  of  the  seed  points  and  that  have 
the  same  gray  values  ensures  that  the  region  is 
shifted  appropriately.  It  does  .lot  matter  too  much 
if  the  corresponding  region  at  the  bottom  of  the 
pyramid  is  too  large,  because  the  pruning  that 
takes  place  at  lower  levels  will  make  sure  that  the 
region  stays  within  the  boundaries  defined  by  the 
edges.  Note  that  the  shift  in  the  edges  is  great¬ 
est  at  the  top  of  the  pyramid,  and  becomes  less 
and  less  as  the  base  level  (the  original  image)  is 
approached.  Because  of  the  links  between  levels, 
the  shifting  is  not  particularly  Important.  Every 
projection  follows  the  links,  both  in  the  gray- 
level  and  the  edge  pyramids,  so  that  the  size  and 
position  of  the  region  converges  to  the  true  size 
and  position  of  the  corresponding  object  as  the 
base  of  the  pyramid  is  approached. 

Note  that  no  threshold  was  applied  to  the  edge 
magnitudes,  so  that  many  weak  edges  remain  at  each 
level.  Most  of  these  do  not  form  links  to  the  next 
level,  or,  at  least  do  not  survive  as  the  size  of 
the  region  that  contains  them  shrinks.  The  noise¬ 
cleaning  step  examines  interior  (l.e.  non-edge) 
points  at  one  level  and  deletes  any  of  their  cen¬ 
tral  2  by  2  sons  that  are  edge  points.  This  step 
can  sometimes  cause  interior  detail  to  be  lost. 

For  example,  in  the  image  of  Figure  5,  the  central 
dark  region  is  filled  in.  Usually,  however,  the 
process  ensures  that  there  are  not  holes  in  the 
flfial  object. 

The  step  of  expanding  the  region  to  conform 
with  the  edge  data  accounts  both  for  the  fact  that 
the  gray-level  pyramid  might  not  match  the  edge 
pyramid  exactly  and  for  the  possibility  that  the 
gray  values  of  the  object  might  not  be  uniform. 

Many  objects  exhibit  a  smooth  transition  with  the 
background.  By  expanding  the  region,  guided  by 
the  edges,  it  is  possible  to  account  for  variations 
in  gray  values. 

Region  splitting  is  applied  for  similar  rea¬ 
sons.  If  there  is  no  change  in  gray  values  between 
the  object  and  the  background  in  the  gray-level 
pyramid,  then  many  points  outside  the  object  will 
be  linked  to  nodes  that  are  interior  nodes  at  a 
higher  level  in  the  pyramid.  Whan  the  gray  values 
are  so  similar,  it  often  happens  that  no  edges  are 
found  at  the  corresponding  positions  on  the  edge 
pyramid.  By  the  nature  of  the  pyramid,  however,  a 
missing  segment  becomes  muller  and  mMller  as  the 
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height  of  the  pyramid  Increases.  By  Interpolating 
across  small  breaks  at  each  level,  a  close  approxi¬ 
mation  to  the  actual  boundary  can  be  obtained. 

This  Interpolation  is  done  by  fitting  lines  through 
the  edges  at  each  level.  All  nodes  that  lie  out¬ 
side  these  lines  are  pruned,  while  those  inside 
are  added  to  the  object.  As  the  resolution  in¬ 
creases  down  the  pyramid,  the  fitting  process 
approximates  the  object  boundary  more  and  more 
closely. 

4.  Examples 

The  procedure  was  applied  to  a  set  of  FLIR 
images  and  to  a  picture  of  a  number  of  chromosomes 
of  varying  sizes.  On  the  whole,  the  results  were 
very  satisfactory,  although  the  method  is  less 
successful  when  the  objects  are  so  small  as  to 
appear  only  in  the  original  image.  In  these  cases, 
there  is  no  smoothing  effect  from  the  pyramids  and, 
because  there  is  no  thresholding  of  the  edge  mag¬ 
nitudes,  a  number  of  small  noise  regions  are  ex¬ 
tracted  together  with  the  desired  regions. 

Figure  6  shows  a  very  clean  example  of  the 
system's  abilities.  The  original  image  consists 
of  a  number  of  chromosomes  against  a  dark  back¬ 
ground.  There  are  no  objects  so  small  as  to  be 
visible  only  at  the  full-resolution  level  of  the 
pyramid,  and  the  objects  are  spread  out  by  size 
across  the  next  two  levels.  Thus,  the  smaller 
chromosomes  appear  in  Figure  6a,  while  the  larger 
chromosomes  appear  in  Figure  6b.  The  larger 
chromosomes  are  also  extracted  in  Figure  6a  be¬ 
cause  their  surround edness  score  is  high  enough  at 
thiB  level.  The  process  mentioned  earlier  of 
choosing  the  best  level  at  which  to  extract  an  ob¬ 
ject  would  enable  the  larger  chromosomes  to  be 
extracted  only  at  the  higher  level. 

It  should  be  realized  that  each  chromosome  is 
extracted  individually,  even  though  the  gray-level 
pyramid  links  them  into  a  single  top-level  node. 

The  chromosomes  are  extracted  cleanly,  despite  the 
gaps  in  edge  information  evident  in  the  edge 
images  (Figure  2) ,  and  despite  the  fact  that  one 
small  chromosome  is  totally  lost  in  the  background 
of  the  gray-level  pyramid. 

Figure  7  shows  an  example  of  the  expansion  of 
the  gray-level  pyramid  region  to  fit  the  edge  image. 
In  the  upper  left  image,  the  original  grayscale 
tank  merges  fairly  smoothly  with  the  background. 
This  results  in  an  original  compact  region  smaller 
than  the  actual  tank  (bottom  left).  The  compact 
object  was  actually  found  at  the  8  by  8  level  of 
the  pyramid,  and  the  bottom  right  image  shows  the 
results  of  adding  in  points  on  the  inside  of  the 
edge  data  at  the  16  by  16,  32  by  32,  and  64  by 
64  levela  of  the  pyramid.  The  result  is  a  region 
whose  shape  is  a  close  approximation  to  the  shape 
of  the  actual  object. 

Figure  8  shows  an  example  where  parts  of  the 
region  outside  the  object  are  discarded  by  the 
pruning  step.  A  node  was  r amoved  because  It  was  on 
the  wrong  side  the  region  boundary,  resulting  In  a 
more  accurate  outline.  In  fact,  such  pruning 
happens  in  almost  all  the  Images. 


Figure  9  straws  what  happens  when  the  objects 
being  sought  are  too  small.  If  an  object  is  not 
large  enough  to  be  represented  at  a  level  above  the 
original  image,  the  only  filtering  taking  place  is 
due  to  the  surroundedness  scoring.  It  is  possible 
for  a  single  noise  point  to  give  rise  to  a  compact 
region,  and  this  would  be  detected  in  addition  to 
any  legitimate  targets.  Noise  cleaning  at  this 
level  eliminates  many  of  the  detected  objects,  but 
can  remove  the  desired  objects  as  well.  By  thres¬ 
holding  the  edge  magnitudes,  however,  a  much  better 
result  can  be  obtained.  A  similar  improvement 
could  be  expected  if  the  surroundedness  scoring 
took  the  edge  magnitudes  into  account.  Even  without 
any  thresholding,  the  number  of  regions  detected 
is  still  less  than  that  for  the  gray-level  linking 
based  segmentation.  On  these  same  images  objects 
that  are  large  enough  to  survive  even  to  the  first 
level  above  the  original  image  are  detected  with 
almost  no  background  clutter.  Figures  10a  to  10s 
show  the  results  obtained  when  the  edge  magnitude  is 
thresholded  (at  15).  Figures  11a  to  Hi,  12a  to 
12f,  and  13a  to  13p  show  the  objects  extracted  at 
successively  higher  levels  without  edge  magnitude 
thresholding. 

In  the  segmentation  study  of  Hartley  et  al. 
(1981),  the  gray-level  linking  method  of  segmenta¬ 
tion  performed  reasonably  well,  except  for  the 
detection  of  a  large  number  of  unwanted  objects 
(false  alarms).  The  current  method,  being  based  on 
the  gray-level  linking  process,  is  guaranteed  to  do 
no  worse  than  that  method.  In  fact,  the  results 
show  that  the  method  significantly  reduces  the  num¬ 
ber  of  false  alarms,  and  often  eliminates  than 
entirely.  The  method  can  also  be  tuned  to  detect 
objects  of  a  particular  range  of  sizes,  and  does  so 
with  no  extra  processing.  If  the  method  were  to  be 
ranked  using  the  scoring  function  of  Hartley  et  al.. 
it  would  rank  ahead  of  all  the  methods  they  tested. 
(Table  I) .  Note  that  most  of  the  images  in  the 
study  had  to  be  sampled  down  to  64  by  64  pixels  be¬ 
cause  that  is  the  largest  size  the  program  can 
handle.  For  those  images  for  which  sampling  was 
not  necessary  (11-30),  the  method  performed  better 
than  the  others.  Overall,  the  method  was  as  good 
at  detecting  targets  as  the  best  method  in  that 
study,  but  had  a  lower  false  alarm  rate,  and  no 
extra  detections.  The  method  would  probably  per¬ 
form  even  better  if  it  were  re-implemented  to  han¬ 
dle  full-resolution  images. 

5.  Discussion  and  Conclusions 

A  method  has  been  presented  that  extracts  com¬ 
pact  objects  from  images.  The  method  uses  three 
kinds  of  pyramid-based  representations.  The  first 
is  a  gray-level  pyramid,  with  links  between  points 
at  successive  levels.  The  second  is  a  pyramid  of 
edge  information  for  each  level,  and  the  third  is 
a  surroundedness  pyramid  that  reflects  the  com¬ 
pactness  of  regions  at  each  level. 

The  resulta  of  applying  the  method  to  a  number 
of  Images  indicate  that  it  is  successful  in  extract¬ 
ing  compact  objects  so  long  as  they  are  large 
enough  to  survive  at  least  to  the  second  level  of 
the  pyramid.  The  extracted  objects  have  borders 
that  closely  follow  the  outlines  in  the  original 


■s 


63 


scene,  as  found  by  the  edge  detector,  and  very  few 
extraneous  regions  are  usually  detected.  Even  in 
the  cases  in  which  the  objects  are  very  small,  they 
are  still  usually  extracted,  although  a  number  of 
unwanted  regions  might  also  be  extracted.  By 
thresholding  the  edge  magnitudes  of  the  original 
image,  most  of  the  unwanted  regions  can  be  dis¬ 
carded,  leaving  only  the  compact  objects.  It  can 
also  be  seen  that  the  process  extracts  only  com¬ 
pact  regions.  For  example,  the  road  in  Figure  14 
is  not  extracted,  because  it  is  elongated  rather 
than  compact. 

Levine  (1980)  discussed  a  pyramid-based  al¬ 
gorithm  for  region  analysis  that  is  related  to  the 
approach  presented  in  this  paper.  He  made  use  of 
three  color  pyramids,  a  texture  pyramid,  and  an 
edge  pyramid.  None  of  the  pyramids  were  construct¬ 
ed  using  overlapping  regions,  and  the  edge  pyramid 
was  formed  by  ORing  4  by  4  regions  of  an  original 
edge  image  to  produce  the  successive  levels.  The 
aim  of  the  research  was  not  to  extract  objects 
with  particular  shapes,  but  to  segment  a  scene  into 
regions.  Processing  involved  finding  points  as  far 
away  from  the  borders  of  regions  as  possible,  by 
finding  the  levels  in  the  edge  pyramid  abovae  which 
a  set  of  edges  disappeared.  These  points  then 
served  as  seeds  for  growing  regions  by  projection 
in  the  pyramids.  At  each  level,  the  boundaries 
between  regions  were  refined  by  a  close  examination 
of  the  neighboring  points.  When  the  final  projec¬ 
tion  was  completed,  a  clean-up  process  was  used  to 
merge  small  regions  with  adjacent  larger  regions. 
The  method  proposed  in  this  paper  makes  more  use 
of  local  gray  values  in  the  analysis,  and  does  not 
need  to  perform  any  postprocessing  of  the  image. 

Earlier  work  has  also  concerned  the  problem 
of  filling  in  regions  from  broken  edge  Information. 
Strong  and  Rosenfeld  (1973)  describe  an  iterative 
procedure  that  simultaneously  grows  regions  and 
fills  in  gaps  in  the  borders.  The  method  described 
here  has  advantages  in  that  the  speed  with  which 
regions  can  be  filled  in  is  significantly  greater 
in  the  pyramid,  as  is  the  distance  over  which  gaps 
in  the  edges  can  be  bridged. 

Danker  and  Rosenfeld  (1979)  examined  the  use 
of  pyramids  to  speed  up  the  propagation  of  edge 
and  region  labels  in  their  relaxation  scheme  for 
extracting  regions,  but  their  results  were  incon¬ 
clusive.  Given  the  ability  to  perform  operations 
in  parallel,  the  current  method  can  be  made  very 
efficient.  The  pyramids  are  all  constructed  in 
one  pass,  although  the  gray-value  pyramid  linking 
process  la  Iterated.  Later  processing  involves  a 
single  pass  through  the  pyramid,  starting  at  the 
level  at  which  the  compact  object  is  found,  and 
ending  at  the  level  of  the  original  image.  All 
processing  within  and  across  levels  is  local  in 
nature,  so  that  the  potential  exists  for  real-time 
implementation  of  the  algorithm.  To  make  the  re¬ 
sults  comparable  with  the  study  of  Hartley  et  al.. 
tha  gray-level  linking  process  was  Iterated.  It  is 
not  clear  that  this  is  necessary  because  the  pro¬ 
cess  does  not  dapend  on  having  regions  with  uniform 
colors. 


It  would  be  of  interest  to  extend  this  work 
by  devising  scoring  functions  to  detect  elongated 
objects,  for  example,  or  objects  of  arbitrary 
shape.  With  a  small  set  of  primitive  shape  rec¬ 
ognizers  it  would  be  possible  to  build  a  powerful 
system  that  could  selectively  extract  objects  hav¬ 
ing  a  wide  variety  of  shapes. 
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Table  I.  Summary  of  results  for  the  comparative  segmentation  study. 
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Figure  1.  Tops  a  gray-level 
pyramid  for  a  chromosome  image. 
Bottoms  the  results  of  iterating 
the  gray-level  linking  process 
(10  iterations) . 


65 


1  B3 

C2 

B2 

Cl 

1  31  | 

1  C3 

A3 

A2 

A1 

1  CO  | 

1  B4  | 

A4 

X 

AO 

'  BO  | 

1  C4  1 

A5 

A6  | 

A  7  | 

1  C7  | 

1  BS  1 

C5 

36  | 

C6  | 

t  B7  | 

Figure  3.  The  5  by  5  neighborhood 
for  computing  surroundedness 
scores.  The  numbers  denote  chain- 
code  directions. 
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Figure  4.  The  7  by  7  neighborhoods  used  to 
fit  lines  through  edge  points.  The  arrow  indi¬ 
cates  the  direction  of  the  edge  point  x,  and 
the  as  indicate  the  region  that  is  examined. 
Rotations  of  these  patterns  are  used  for  other 
edge  directions. 


Figure  5.  Top  left:  the  original 
FLIR  image  of  an  armored  personnel 
carrier.  Top  right:  ‘-he  edge  image 
projected  down  from  the  level  at 
which  the  compact  object  was  found 
(8  by  8).  Bottom  left:  the  compact 
object  found  at  level  3  (8  by  8) 
Without  deleting  interior  edges. 
Bottom  right:  the  result  of  apply¬ 
ing  the  whole  process  to  the  image. 
The  hole  in  the  middle  has  been 
filled  in. 
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Figure  6.  a)  The  chromosomes  extracted  at 
level  ).  (32  by  32),  the  first 
level  above  the  original  image. 


b)  The  chromosomes  extracted  at 
level  2  (16  by  16). 


Figure  7.  Top  left:  original 
FLIR  image  of  a  tank.  Top  right: 
edge  image  projected  from  the  8  by 
8  level.  Bottom  left:  the  com¬ 
pact  object  found  at  the  8  by  8 
level.  Bottom  right:  the  results 
of  adding  points  to  fit  the  edge 
data. 


Figure  8.  Top  left:  original  FLIR 
image.  Top  right:  edge  image  pro¬ 
jected  from  the  8  by  8  level.  Bot¬ 
tom  left:  compact  object  without 
pruning.  Bottom  right:  compact 
object  after  pruning. 
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Figure  12.  a-f.  The  results  of  running  the  process  on  images  where  the  objects  arc 
extracted  at  level  2  (16  by  16). 


Figure  14.  Fart  of  a  suburban  scene  with 
a  road  and  a  house.  The  house  is  compact 
enough  to  be  extracted,  but  the  road  is 
not. 
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ABSTRACT 

This  paper  describes  a  method  of  image  seg¬ 
mentation  chat  creates  a  partition  of  the  image 
into  compact,  homogeneous  regions  using  a  parallel, 
iterative  approach  that  does  not  require  immediate 
forced  choices.  The  approach  makes  use  of  a 
"pyramid"  of  successively  reduced-resolution  ver¬ 
sions  of  the  image.  It  defines  link  strengths 
between  pairs  of  pixels  at  successive  levels  of 
this  pyramid,  based  on  proximity  and  similarity, 
and  iteratively  recomputes  the  pixel  values  and 
adjusts  the  link  strengths.  After  a  few  itera¬ 
tions,  the  link  strengths  stabilize,  and  the  links 
that  remain  strong  define  a  set  of  subtrees  of  the 
pyramid.  Each  such  tree  represents  a  compact 
(piece  of  a)  homogeneous  region  in  the  image;  the 
leaves  of  the  subtree  are  the  pixels  in  the  region, 
and  the  size  of  the  region  depends  on  how  high  the 
root  of  the  tree  lies  in  the  pyramid.  Thus  the 
trees  define  a  partition  of  the  image  into  (pieces 
of)  homogeneous  regions. 

1.  Introduction 


il 


based  on  the  values  at  neighboring  pixels  and  the 
compatibilities  of  the  various  possible  combina¬ 
tions  of  class  memberships  of  pairs  of  neighbors. 
After  a  few  iterations,  the  membership  values 
stabilize,  with  some  values  becoming  or  remaining 
relatively  high  and  others  becoming  very  low,  so 
that  it  becomes  easy  to  make  the  final  classifica¬ 
tion  decisions. 

Segmentation  by  partitioning  into  homogeneous 
regions  -  e.g.,  regions  of  approximately  constant 
value  -  is  generally  more  powerful  than  segmenta¬ 
tion  by  pixel  classification,  because  the  informa¬ 
tion  on  which  it  is  based  is  computed  over  regions 
rather  than  ("myopically")  over  small  neighborhoods 
of  pixels.  Thus  it  would  be  desirable  to  develop 
a  region-based  segmentation  scheme  in  which  de¬ 
cisions  are  not  made  immediately.  This  paper 
defines  such  a  scheme  and  gives  examples  of  the 
results  obtained  when  it  is  applied  to  various 
types  of  Images.  Section  2  describes  the  general 
principles  of  this  scheme  and  compares  it  with 
some  related  approaches;  Section  3  discusses  the 
algorithm;  and  Section  4  presents  experimental 
results. 


Most  of  the  existing  methods  of  image  segmen¬ 
tation  [1,2]  are  based  on  forced-choice  decisions. 
In  methods  that  classify  pixels  into  subpopula¬ 
tions,  we  must  decide  to  which  class  each  pixel 
belongs.  In  methods  that  partition  the  image  into 
homogeneous  regions  using  splitting  and  merging 
processes,  we  must  decide,  for  each  current  region, 
whether  to  split  it,  or  whether  to  merge  it  with  a 
neighboring  region  (and  if  so,  with  which  one). 
This  forced-choice  aspect  of  segmentation  is 
undesirable,  since  many  of  the  decisions  may  be 
wrong,  particularly  when  they  are  made  on  the  basis 
of  very  little  Information,  and  it  is  difficult 
to  undo  the  effects  of  wrong  decisions. 

In  segmentation  by  pixel  classification,  a 
"relaxation"  approach  [3]  can  be  used  to  defer  the 
classification  decisions  until  more  information  is 
available.  In  this  approach  we  compute  a  degree 
of  membership  for  each  pixel  in  each  class,  or  a 
"probability"  that  it  belongs  to  each  class;  and 
we  then  iteratively  adjust  these  membership  values. 


The  support  of  the  Defense  Advanced  Research 
Projects  Agency  and  the  U.S.  Army  Night  Vision 
Laboratory  under  Contract  DAAG-53-76C-0138  (DARPA 
Order  3206)  it  gratefully  acknowledged,  as  is  the 
help  at  Clara  Robertson  in  preparing  thia  paper. 


2.  Weighted  pyramid  linking 

Our  approach  to  unforced  image  partitioning 
makes  use  of  a  "pyramid"  of  successively  reduced- 
resolution  versions  of  the  given  image,  say  of 

sizes  2n  by  2”,  2n_1  by  2n"l .  2x2.  The  bast 

of  the  pyramid  (level  0)  is  the  input  image,  and 
each  successive  level  is  constructed  by  averaging 
4  by  4  blocks  of  pixels  on  the  level  below,  where 
the  blocks  overlap  502  in  x  and  in  y.  (For  con¬ 
venience,  each  level  is  regarded  as  cyclically 
closed,  so  that  its  top  row  is  adjacent  to  its 
bottom  row  and  its  left  column  to  its  right  column.) 
Thus  each  pixel  on  a  given  level  has  16  "sons"  on 
the  level  below  (if  any)  that  contribute  to  its 
average,  and  4  "fathers"  on  the  level  above  (if 
any)  to  whose  average  it  contributes.  This  type  of 
pyramid  has  also  been  used  for  segmentation  pur¬ 
poses  by  other  investigators;  e.g.,  see  the  work  of 
Hanson  and  Rlseman  described  in  [4] . 

The  basic  idea  in  our  approach  is  to  define 
link  strengths  between  "neighboring"  pixels  (i.e., 
father/son  pairs)  on  adjacent  levels  of  the  pyramid, 
based  on  the  similarity  (in  value)  and  proximity 
(in  (x,y)  coordinates)  of  each  such  pair.  We 
then  recompute  the  pixel  values  (at  the  levels 
above  the  base)  as  weighted  averages  of  their  sons' 
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values,  where  the  weights  depend  on  the  link 
strengths,  these  new  values  define  new  link 
strengths,  and  the  process  is  iterated.  (The  de¬ 
tails  of  the  algorithm  will  be  given  in  the  next 
section.)  After  a  few  iterations,  the  link 
strengths  stabilize,  and  the  links  that  remain 
strong  define  subtrees  of  the  pyramid.  As  it  turns 
out,  each  such  tree  defines  a  compact  homogeneous 
region  in  the  image,  where  the  leaves  of  the  tree 
are  the  pixels  belonging  to  the  region,  and  the 
height  of  the  tree  corresponds  to  the  region  size 
(the  larger  the  region,  the  higher  the  root  of  its 
tree  lies  in  the  pyramid) .  Thus  the  weighted  links 
can  be  used  to  define  a  partition  of  the  image  into 
compact  homogeneous  regions.  Note  that  this  par¬ 
tition  is  not  defined  immediately,  but  only  after 
the  link  weights  have  stabilized. 


pixel.  Extensions  of  this  scheme  to  segmentation 
based  on  color  or  texture,  and  to  waveform  or  con¬ 
tour  segmemtation,  are  described  in  [7-101. 

A  pyramid  linking  method  which  does  make  use 
of  all  the  link  strengths,  rather  than  discarding 
all  but  the  strongest  upward  link,  is  described  in 
[111 •  However,  in  this  method  the  link  strengths 
are  normalized  so  that  they  sum  to  1;  thus  here 
too  the  links  are  forced  to  extend  upward  from 
every  pixel  (divided  among  its  fathers  appropri¬ 
ately)  all  the  way  to  the  top  level.  In  fact, 
the  link  strengths  tend  to  converge  to  0  or  1 
where  the  process  is  Iterated,  so  that  this  method 
too  defines  a  segmentation  of  the  image  into  four 
subpopulations  of  pixels,  rather  than  a  partition 
into  regions. 


To  see  intuitively  why  this  approach  should 
work,  consider  the  case  of  a  homogeneous  compact 
region  on  a  homogeneous  background.  Pixels  in  the 
interior  of  the  region  (or  background)  will  link 
strongly  to  all  their  fathers,  since  these  fathers' 
values  are  averages  of  image  blocks  that  lie  in 
the  same  regionl  A  pixel  near  the  region  border, 
however,  will  link  more  strongly  to  a  father  that 
lies  inside  the  region  than  to  one  that  lies  partly 
outside,  since  it  is  more  similar  in  value  to  the 
former.  Thus  when  we  recompute  the  fathers' 
values,  a  father  whose  image  block  lies  mostly  in¬ 
side  the  region  will  get  closer  in  value  to  the 
average  of  the  region,  since  it  is  more  strongly 
linked  to  its  sons  that  lie  in  the  region  than  to 
those  that  lie  in  the  background;  and  conversely. 
This  makes  its  links  to  the  former  sons  even  strong¬ 
er,  and  to  the  latter  even  weaker,  so  that  the  link 
strengths  and  values  should  converge.  Now  consider 
a  pixel  whose  block  lies  mostly  inside  the  region, 
but  whose  fathers'  blocks  all  lie  mostly  outside, 
because  they  are  bigger  than  the  region.  By  the 
argument  just  given,  the  pixel's  value  should  tend 
toward  the  region  average,  while  its  fathers' 
values  should  tend  toward  the  background  average,  so 
that  the  pixel  does  not  remain  strongly  linked  to 
any  of  its  fathers,  and  becomes  the  root  of  a  tree 
representing  (a  compact  portion  of)  the  region. 

It  is  of  interest  to  compare  this  approach  to 
some  earlier  segmentation  schemes  based  on  pyramid 
linking  or  on  link  strengths.  In  [5-6]  link 
strengths  are  computed  between  each  father/son  pair, 
but  we  keep  only  the  strongest  of  the  four  links 
between  a  pixel  and  its  fathers.  We  then  recompute 
the  pixel  values  allowing  only  those  sons  that  are 
linked  to  a  pixel  to  contribute  to  its  value;  re¬ 
compute  the  link  strengths  based  on  these  new 
values;  and  Iterate  the  process.  Note  that  in  this 
scheme  every  pixel  must  link  to  one  of  its  fathers; 
thus  the  links  define  precisely  four  trees,  rooted 
at  the  top  (2x2)  level,  so  that  the  image  is  seg¬ 
mented  into  precisely  four  sets  of  pixels.  These 
sets  do  not  correspond  to  compact  regions,  but  do 
tend  to  correspond  to  homogeneous  subpopulations  of 
pixels.  Thus  the  segmentation  scheme  of  [1-2]  is 
more  like  a  pixel  clustering  and  classification 
scheme  than  an  image  partitioning  scheme;  and  it 
also  makes  forced  choices  immediately,  since  it 
keeps  only  the  strongest  upward  link  from  each 


A  weighted  pixel  linking  scheme  not  involving 
a  pyramid  is  described  in  [12].  Here  a  link 
strength  is  computed  for  each  pair  of  neighboring 
pixels  based  on  their  closeness  in  value.  The 
image  is  then  smoothed  by  replacing  each  pixel  with 
the  average  of  its  neighbors,  weighted  by  their 
link  strengths.  Using  these  new  values,  the  link 
strengths  are  recomputed,  and  the  process  is 
iterated.  This  tends  to  ptroduce  a  very  high- 
quality  smoothing,  and  the  links  that  remain 
strong  could  be  used  to  define  a  segmentation  of 
the  image  into  homogeneous  regions;  but  this 
method  would  not  always  be  reliable,  since  it  is 
based  on  small  neighborhoods.  The  method  defined 
in  this  paper  is  analogous  to  the  scheme  in  [12], 
but  using  "vertical"  links  (between  fathers  and 
sons)  in  a  pyramid,  rather  than  "horizontal"  links 
(between  brothers)  in  an  image  at  a  single  resolu¬ 
tion.  Our  method  could  be  generalized  to  make  use 
of  horizontal  as  well  as  vertical  link  strengths, 
but  we  shall  not  pursue  this  possibility  here. 

3.  The  algorithm 

The  algorithm  is  initialized,  as  mentioned 
earlier,  by  building  the  pyramid  using  unweighted 
averaging  of  4x4  blocks  that  overlap  50Z  horizon¬ 
tally  and  vertically.  Alternatively,  we  could  use 
nonoverlapping  2x2  blocks  (for  the  Initialization 
only;  a  pixel  still  has  16  sons  in  the  subsequent 
steps),  or  we  could  use  the  median  instead  of  the 
mean;  but  these  variations  were  found  to  make 
little  difference  in  the  results. 

Let  v(P)  denote  the  value  of  pixel  P  in  the 
pyramid,  say  on  level  i.  Initially,  if  i-0  this 
is  the  gray  level  of  an  input  pixel,  and  if 
i> 0  it  is  the  mean  of  the  values  of  P's  16  sons. 

Let  0(P)  be  the  standard  deviation  of  these  sons' 
values  (or  if  i»0,  we  take  o  to  be  a  constant;  we 
used  5  in  our  experiments). 

Let  P*  be  one  of  the  fathers  of  P.  The  link 
strength  between  P  and  P*  is  defined  by 


,  ■  ,v(P)-v(P*)  |2. 

w(P,P»)  =  (i+d(Pfp* ))_!!!! -  °<p>  - 

/2ir  o(P) 
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trees  by  using  only  their  most  strongly  linked 
fathers. 


In  this  expression,  the  first  factor  depends  on  the 
distance  between  (the  centers  of)  P  and  P*;  d  is 
taken  to  be  3  for  the  closest  father,  1  for  the 
farthest,  and  v5  for  the  other  two.  (It  can  be 
verified  that  these  are  proportional  to  the  Eucli¬ 
dean  distances  between  the  centers.)  This  factor 
makes  the  sets  of  pixels  that  belong  to  a  given 
tree  more  compact;  if  it  is  omitted,  these  sets 
become  more  irregular  in  shape.  The  factors  t 
reflect  the  (non)  variability  of  the  sons  of 
P;  if  they  are  highly  variable,  P  does  not  link 
strongly  to  any  of  its  fathers.  Finally,  the  exp 
factor  depends  on  the  similarly  in  value  of  P  and 
P*;  if  they  are  very  dissimilar,  the  link  is  weak. 


We  now  want  to  recompute  the  pixel  values  at 
levels  '*0  as  weighted  averages  of  their  sons* 
values,  where  the  weights  depend  on  the  link 
strengths.  Note  first  that  the  weight  given  to  a 
son  must  also  depend  on  the  (weighted)  "area"  of 
the  image  represented  by  that  son;  for  example, 
if  one  son  had  unit  strength  links  (down  through 
successive  levels)  to  a  single  image  pixel,  and 
zero  strengths  to  all  its  other  descendants,  we 
would  not  want  to  give  it  as  much  weight  as  a  son 
that  had  high-strength  links  to  many  image  pixels. 
Let  a(P)  be  the  "area"  of4  pixel  P;  initially,  for 
a  pixel  at  level  £,  we  have  a(P)  *2^,  since  P 
represents  a  2^  by  2^  image  block.  Subsequently, 
let  a(P')  be  the  area  of  a  son  P'  of  P,  and  let 
w(Ff,P)  be  the  link  strength  between  them.  Then 
a(P)»i:  w(P*  ,P)a(P,)/W(P*)  (where  the  sum  is  over 
the  sons  of  P*  of  P) ;  here  W=L  w(P',P*)  (the  sum 

p  *  * 

being  over  the  fathers  P'*  of  P>).  Note  that  in 
computing  a(P)  we  are  actually  using  normalised 
weights,  i.e.,  t  w(P' ,P'*)/W-1.  This  is  because 
it  seems  reasonable  that  the  "area"  of  a  pixel 
should  be  distributed  among  its  fathers  in  a 
normalized  fashion,  in  order  to  insure  that  the 
total  "area"  of  all  pixels  at  a  given  level 
remains  equal  to  the  area  of  the  image. 


4.  Exper  iments 

The  algorithm  just  described  was  applied  to  the 
three  images  shown  in  Figure  1:  photomicrographs 
of  some  chromosomes  (right)  and  blood  cells  (left), 
and  an  infrared  image  of  a  tank.  Each  image  is 
64x64  pixels;  thus  the  top  (2x2)  level  of  the 
pyramid  is  level  5.  At  each  Iteration,  the  gray 
level  displayed  for  each  pixel  is  the  value  at  the 
root  of  its  tree.  We  see  that  even  after  a  single 
iteration,  the  trees  define  a  decomposition  of  the 
image  into  regions  having  a  small  set  of  values; 
and  in  one  or  two  more  iterations  the  set  of  values 
is  reduced  even  further. 

Table  1  lists  the  root  nodes  at  each  level, 
and  their  values,  for  each  image  for  as  many  iter¬ 
ations  as  were  needed  until  there  was  no  further 
change  in  the  set  of  roots.  We  see  that  the  more 
complex  the  image,  the  more  iterations  are  required 
for  the  set  of  roots  to  stabilize;  but  that  even 
for  the  most  complex  image,  the  changes  after  the 
first  two  or  three  iterations  have  little  effect 
on  the  segmentation  of  the  image. 

Figure  2  shows  printouts  of  the  displayed 
images  after  the  first  (parts  a-c)  and  last 
(parts  d-f)  iterations,  where  the  value  printed  in 
each  region  identifies  the  root  of  the  tree  to 
which  it  belongs;  the  digit  is  the  level,  and  the 
letters  are  used  to  distinguish  the  roots  at  that 
level.  We  see  that  after  a  few  Iterations,  the 
leaves  of  each  tree  del ine  a  small  set  of  compact 
regions.  As  Table  1  indicates,  regions  that  are 
compact  pieces  of  a  single  homogeneous  region  have 
nearly  the  same  value.  Note  that  because  of  the 
coordinate  wraparound,  regions  on  opposite  sides 
of  the  image  may  belong  to  the  same  tree. 


Finally,  the  new  value  of  pixel  P  is  given  in 
terms  of  its  sons’  values  by 


v(P) 


l  vfP'jafP’jwfP’.P) 
P* _ 


l  a(P’)w(P',P) 
P' 


where  the  sums  are  over  the  sons  P'  of  P.  Sim¬ 
ilarly,  the  new  standard  deviation  is  given  by 


o(P) 


/E(v(P)  -v(P'))2a(P')w(P’,P) 
P' 


l  a(P')w(P',P) 
P* 


The  process  is  iterated;  in  our  experiments,  only 
two  or  three  iterations  were  necessary. 

After  the  desired  number  of  Iterations,  we 
call  a  pixel  a  "root"  if  it  is  on  the  top  level 
(2x2),  or  if  the  sum  of  its  link  strengths  to  all 
its  fathers  Is  negligible  (in  our  experiments: 
<10“5) ,  The  nonroot  pixels  are  then  assigned  to 


5,  Concluding  remarks 

We  have  exhibited  a  method  of  segmenting  an 
image  into  compact  honogeneous  regions  by  con¬ 
structing  links  between  "neighboring"  pixels  at 
consecutive  levels  of  a  "pyramid". 

An  important  feature  of  this  method  is  that 
each  region  is  represented  by  a  tree  having  the 
pixels  of  the  region  as  leaves.  The  height  of  this 
tree  is  proportional  to  the  log  of  the  region  size. 
Thus,  even  for  large  regions,  all  the  pixels  in  the 
region  are  relatively  closely  linked  to  the  root  of 
the  tree,  and  hence  to  each  other.  The  pyramid 
structure  makes  it  possible  for  Information  to 
propagate  between  different  parts  of  a  region 
relatively  rapidly.  Moreover,  the  root  of  the  tree 
can  be  used  as  a  node  to  represent  the  region  in 
various  region-level  relational  structures.  Thus 
the  tree  constitutes  a  transition  between  the 
pixel-level  representation  of  the  region  and  more 
abstract  representations. 

Another  Important  feature  of  our  method  is 
that  the  trees  are  produced  by  a  cooperative  pro¬ 
cess  in  which  link  strengths  are  iteratively 
adjusted.  Under  this  process,  root  pixels 


representing  regions  become  easy  to  recognize,  be¬ 
cause  their  link  strengths  to  their  fathers  all 
become  negligible.  They  are  harder  to  recognize 
in  the  original  pyramid,  where  the  pixels  (espe¬ 
cially  at  higher  levels)  represent  mixtures  of 
image  pixels,  so  that  the  link  strengths  are  not 
initially  negligible. 

Image  processing  and  segmentation  techniques 
based  on  "local"  operations  performed  in  a  pyramid 
can  be  implemented  very  rapidly  in  parallel  on  a 
tree-structured  cellular  processor  113] .  It  is 
possible  that  processes  of  this  type  also  play  a 
role  in  biological  visual  systems,  where  the  input 
image  is  represented  at  a  range  of  resolutions 
114). 
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Figure  1.  a)  Input  images  and  their  histograms 

b)  Results  after  one  iteration 

c)  Results  after  last  Iteration 
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1  Table  1. 

Root  nodes  and  their  values  at  each  Iteration  for  the  three  Images.  The 

root  labels  (see 

* 

Figure  2) 

are  given  only  for  the 

first  and 

last  Iterations. 

1  (a)  Cell 

image;  there  were  no  changes  in 

the  set 

(b)  Tank  image;  no 

changes 

after  the  second  iteration. 

of  root  nodes 

after  the 

third  iteration. 

Note  that  one  of  the  roots  is  a  single  pixel. 

1  ! 

Root 

1  Iteration 

Level 

Label 

Coordinates 

Value 

Root 

(0,3) 

5.65 

Iteration  Level 
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Coordinates 

Value 
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A 
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(0,4) 
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1  4 
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(1,0) 

23.49 

1 
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(1,1) 

23.40 

4 

A 

(3,3) 

5.5o 

C 
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1 

1 

B 
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5.22 

D 
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1 

C 
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1 

D 
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2  0 
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| 

L 
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F 
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1 
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5 

A 
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2 

3 
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3  0 
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32.96 
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> 
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D 

(0,1) 
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(1,1) 
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(0,0) 
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1  *  * 

(0,1) 

13.27 

(c)  Chromosome  image;  no  changes  after  the 

3 

3 

(0,3) 

5.32 

eighth  iteration. 

(0,4) 

5.49 

(3,0) 

5.63 

Root 

(4,0) 

5.89 

Iteration  Level 
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Coordinates 

Value 

(6,1) 

5.78 

1  2 

A 

(15,14) 
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K 

4 
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B 
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D 
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48.73 
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*1  1  i 

5 

(1,0) 
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F 
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26.39 
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3 

A 

B 

C 
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} 
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C 

D 
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E 

F 

G 

H 
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5 

A 
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■ 
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ABSTRACT 


After  a  review  of  shape-from-shading  basics,  we  describe 
experiments  in  constraint  interaction.  Boundary  conditions  can 
interact  significantly  to  influence  computed  shape.  Interleaved, 
cooperating  shape-from-shading  and  iUuminant-direcuon  calculations 
show  constraint  interacuon  in  underdetermined  computation  of 
physical  parameter  images.  ^ 

I.  Background  and  Overview 

The  goal  of  a  shape  from  shading  algorithm  is  to  derive  3-D 
shape  information  from  a  2-D  image  in  which  the  value  at  a  point 
(the  image  irradiance)  is  proportional  to  the  intensity  of  light 
reflected  by  a  small  foreshortened  area  of  opaque  surface  (the  scene 
radiance).  Shape  is  a  physical  properly  of  an  object.  In  this  work. 
Shape  refers  to  the  mathemaucal  specification  of  the  surface  -  its 
-height  as  a  function  of  two-dimensional  position,  say.  We  define  the 
term  parameter  image  to  mean  an  image-like  array  of  physical 
parameters,  perhaps  expressed  in  viewer  centered  coordinates.  We 
use  intrinsic  image  (Barrow  and  Tenenbaum  19781  to  mean  a 
parameter  image  expressed  in  object-centered  coordinates.  Shape,  by 
almost  any  qualitative  or  quantitative  definition,  is  determined  by 
the  large-scale  variation  of  surface  orientation  over  the  object  This 
macroscopic  variation  is  linked  with  the  infinitesimal  behavior  of  the 
surface  -  its  local  orientation.  Knowledge  of  local  orientation  is 
enough  to  allow  the  surface  function  to  be  reconstructed  (up  to  a 
constant  depth  offset).  Such  an  array  of  orientatior  is  a  shape 
parameter  image;  an  array  of  heights  is  another. 

One  of  our  goals  is  to  derive  an  orientation  image  from  an  input 
intensity  unage.  The  computation  of  an  array  of  vector  orientations 
from  an  array  of  scalar  intensities  is  in  general  underdetermined. 
Usually  the  computation  is  made  possible  through  assumptions  about 
the  conunuity  of  physical  objects  (surface  smoothness)  and  about  the 
imaging  situation  (the  composition  and  placement  of  the  illuminant 
and  the  reflectivity  of  the  objects  in  the  scene),  taken  with  boundary 
conditions  (a  set  of  points  at  which  orientation  is  known).  Under 
these  conditions  (he  variation  of  intensity  (the  shading)  in  the  image 
can  indeed  yield  shape.  This  is  one  of  the  most  influential  ideas  in 
modern  computer  vision  (Horn  1970,  1975,  1977;  Woodham  1978; 
Stral  1979;  Brooks  1979;  Barrow  and  Tenenbaum  1981;  Ikeuchi  and 
Horn  1981).  Good  discussions  of  various  approaches  appear  in  (Strat 
1979;  Ikeuchi  and  Horn  1981;  Grimson  1981],  Modern  approaches 
(e.g.  lOreuchi  and  Horn  1981U  use  parallel-iterative  techniques  to 
minimize  error  terms  arising  from  the  smoothness  and  reflectance 
constraints.  Grimson  |1981]  has  studied  algorithms  for  surface 
interpolation,  deriving  conditions  for  surface  consistency  functionals 
and  investigaung  convergence  properties.  Bruss  (1981]  has  studied  the 
mathematical  properties  of  the  irradiance  equauon,  which  lies  at  the 
base  of  the  shape-from-shading  enttiprise. 

The  human  visual  system  seems  capable  of  extracting  shape  with 
less  explicit  information  than  shape  from  shading  mathematics  needs. 
In  particular,  it  may  be  unnecessary  to  know  the  imaging  geometry 
(the  illuminant  direction)  aprion.  One  factor  here  may  weU  be  the 


- iency)(Pe -  -  . 

image  calculations  are  mutually  constraining,  and  that  they  can 
converge  to  consistent  results  despite  seeming  indeterminacy. 

Our  concern  is  with  the  interaction  of  parameter  image 
computations,  which  is  a  large  and  interesting  area.  Depth 
information  is  known  to  influence  lightness  computations,  for 


example  [Gilchnst  1979).  Rigid  body  motion  and  shape  ran  be 
computed  from  optical  flow  (Ballard  and  Kimball  1982|.  One 
experiment  we  describe  has  to  do  with  concurrent  compulauons  of 
illuminant  direction  and  shape.  Our  input  is  an  intensity  unage  and  a 
reflectance  function  for  the  surface,  but  no  information  about 
imaging  geometry  (such  as  a  reflectance  map  or  the  illuminant 
direction).  Our  strategy  is  to  interleave  a  standard  relaxation 
computation  for  shape  with  a  step  that  adjusts  the  light  source 
position  to  be  consistent  with  the  developing  shape  results  and  the 
intensity  data.  The  partial  results  should  consuain  each  other,  and 
the  research  is  to  determine  how. 

Unfortunately,  the  theory  of  first  order  parual  differential 
equations  guarantees  (loosely)  that  given  an  intensity  image,  then  for 
any  reflectance  map  there  is  a  soluuon  (a  2-D  height  function  of  x,y) 
passing  through  any  well  behaved  strip  (a  curve  whose  height  is  a 
mnction  of  a  curve  in  the  x.y  plane)  that  does  not  happen  to  be  a 
characteristic  stnp  (located  along  gradient  maxima)  (Bruss  1981].  Just 
about  any  intensity  image  is  consistent  with  just  about  any  illuminant 
direction. 

Clearly  then  what  makes  the  co-operating  denvation  of  shape  and 
illuminant  direction  interesting  is  (he  interaction,  through  an 
algorithm,  of  constraints  on  shape,  on  reflectance,  and  on 
orientations  and  intensities  known  apriori.  We  have  constructed  a 
flexible  computational  laboratory  to  explore  algonthms  and  the 
interactions  of  such  constraints  in  them.  We  give  a  brief  review  of 
orientation  space  parameterization  in  Section  2,  reflectance  functions 
in  Section  J.  surface  smoothness  measures  in  Section  4,  relevant 
shape-deriving  techniques  in  Section  5.  and  parameter  transforms  in 
Section  6.  Section  7  describes  the  mplementation  of  the  the 
algorithm  for  shape  and  illuminant  direction  determination.  Section  8 
gives  experimental  results. 

2.  Orientation  Spaces 

The  representation  and  manipulation  of  directions  is  basic  to  the 
enterprise  of  deriving  shape  from  shading  The  orientation  of  a 
surface  at  a  point  is  determined  by  the  direction  of  its  surface 
normal  (a  vector  perpendicular  to  the  surface)  there.  The  Sections 

2.1  -  2.4  briefly  present  useful  orientation  representations.  Ikeuchi 
and  Horn  (1981)  give  a  good  brief  treatment  of  spherical  projections. 

2.1  Polar  Space  (Gaussian  Sphere) 

The  direction  of  a  vector  specifies  a  direction  in  space.  If  the 
vector  is  based  at  the  origin,  its  components  (the  polar  space 
coordinates  of  a  point)  are  the  direction  numbers  of  the  line  from 
the  origin  through  the  point.  If  the  vector  is  of  unit  length,  then  its 
Cartesian  components  (x.y.z)  are  direction  cosines,  and  indicate  a 
point  on  the  unit  (Gaussian)  sphere,  which  is  often  taken  to 
represent  3-D  directions  (Gauss  1965).  The  points  (0,0,1)  and  (0.0, -1) 
are  respectively  the  z  and  -z  poles  of  the  Gaussian  sphere. 

This  redundant  three-dimensional  (x,y,z)  representation  of 
directions  has  the  advantage  of  intuitive  clarity,  good  interpolation 
properties  IBarrow  and  Tenenbaum  1781),  and  a  straightforward 
geometrical  interpretation  |Brown  1979). 

2.2  Gradient  Space:  Slanl  and  TIH 

If  a  surface  is  represented  as  a  function  f(x,y)  of  two  dimensions, 
then  its  gradients,  or  partial  derivatives  (p,q)  at  a  point  represent  the 
orientation  of  the  surface  at  that  point. 
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p  =  jf/ax 

q  =  >f/ay 

Gradient  space  is  the  projection  of  half  the  Gaussian  sphere  onto 
a  plane.  The  plane  of  gradient  space  is  tangent  to  the  Gaussian 
sphere  at  the  z  pole,  and  the  point  of  projection  is  the  origin  (center) 
of  the  sphere.  The  equator  is  projected  out  to  points  at  infinity  in 
(p,q)  space,  and  one  hemisphere  is  not  projected  at  all. 

The  slant  of  a  surface  is  the  amount  of  its  inclination  away  from 
the  z  axis  (along  which  a  viewer  is  assumed  to  look).  Slant  is  taken 
to  be  zero  when  the  surface  is  perpendicular  to  the  i  axis  (its  normal 
is  parallel  to  the  z-axis),  and  infinity  when  it  is  parallel  to  the  z-axis 
(its  normal  is  perpendicular  the  z-axis).  The  slant  of  an  orientation  in 
(p.q)  gradient  space  is  its  distance  from  the  origin;  all  orientations  on 
a  circle  centered  at  the  origin  have  the  same  slant  Thus  points 
around  the  equator  of  the  Gaussian  sphere  represent  orientations 
with  infinite  slant. 

The  tilt  of  a  surface  is  the  direchon  of  its  slant.  In  gradient  space, 
the  till  of  a  surface  with  gradients  (p.q)  is  measured  by  the  angle 
between  a  fixed  direction  (perhaps  the  p  axis)  and  the  line  joining 
the  origin  and  (p.q).  Tilt  and  slant  are  simply  polar  coordinates  of 
gradient  space. 

Gradient  space  has  been  a  popular  tool  in  computer  vision  for 
some  time  (Horn  1970,  1975,  1977;  Mackworth  1973;  Huffman  1978], 
Its  advantages  come  from  its  dual  relation  to  image  space  and  its 
mathematical  relation  to  the  surface  height  function.  (A  2-D  array  of 
(p.q)  vectors  may  be  integrated  directly  to  recover  fjx.y)  to  within  a 
constant  depth  displacement.)  However,  it  only  represents  half  the 
space  of  orientations,  it  has  a  singularity  at  an  important  angle  (that 
normal  to  the  z-axis  viewing  direction),  and  of  course  it  does  not 
lend  itself  to  linear  interpolation  for  directions.  To  convert  from 
gradient  space  to  polar  space,  use 

(x.y.z)  =  (p,  -q.  1)  /  /(p2  +  q2  +1). 

2.3  Stereographic  Space 

if  the  Gaussian  sphere  is  projected,  again  onto  a  plane  tangent  to 
its  z-pole.  with  the  -z  pole  as  the  point  of  projection,  the  plane  has 
orientations  represented  in  (f.g)  stereograpnic  space.  Points  on  the 
equator  (surfaces  of  infinite  slant)  project  to  a  circle  of  radius  two, 
and  the  whole  Gaussian  sphere  is  represented  on  the  infinite  (f.g) 
plane.  The  main  advantage  of  stereographic  space  is  that  it  represents 
more  orientations  than  does  gradient  space  and  only  has  a  singularity 
at  a  single  angle.  It  yields  complicated  analytic  expressions,  and  again 
linear  interpolation  is  not  exact  |Ikeuchi  and  Horn  1981).  To  convert 
from  gradient  space  to  stereographic  space,  use 

f  =  2p[  V  (p2  +  q2  +  1)]  /  (p2  +  q2) 

g  =  2q|  ✓  (p2  +  q2  +  1)]  /  (p2  +  q2). 

2.4  Spherical  Space  (Longitude  and  Co-latitude) 

This  orientation  space  is  familiar  from  geography  and  spherical 
coordinates.  A  point  on  the  Gaussian  sphere  is  identified  by  (9,  f). 
its  longitude  and  co-latitude  (w/2  -  latitude).  The  equator  of  the 
Gaussian  sphere  is  at  9  =  e/2.  To  convert  from  gradient  space  to 
spherical  space,  use 

9  =  arctan  ( ✓  (p2  +  q2)) 

=  arctan(p/q). 


3.  RelectMce 

Generally,  the  apparent  brightness  of  a  surface  under 
illumination  depends  on  the  surface  and  on  the  geometry  of  the 
viewing  situation.  This  dependence  is  captured  in  the  Bi-directional 
Reflectance  Function  (BDRF)  (Nicodeirma  et  aL  1977).  The  BDRF 
for  any  surface  is  a  function  of  die  incidence,  emiuance  (viewing), 
and  phase  angles  of  light  on  a  surface  patch,  which  themselves  are 
defined  relative  to  the  surface  normal.  Hots  and  Staberg  [1978] 
explore  the  BDRF  and  derive  simplified  forms  of  it  from  physical 
fim  principles. 


'  When  the  imaging  geometry  is  fixed  (orthographic  projecuon. 
fixed  viewer,  fixed  object  of  fixed  BDRF,  and  fixed  light  sources), 
the  reflectance  properues  of  a  surface  become  a  funcuon  just  of  its 
orientation  with  respect  to  the  viewer.  This  dependence  is  made 
explicit  in  a  reflectance  map 

R:  Orientation  -->  Intensity, 

which  assigns  an  image  brightness  to  each  surface  orientation. 

The  iriadiance  equation  expresses  the  relation  between  image 
intensity  and  surface  onentauon: 

I(ij)  =  R(OrientaUon(ij)).  (3.1) 

The  reflectance  map  is  a  powerful  tool.  Looked  at  as  a  contour 
map  of  brightnesses,  it  restricts  the  possible  onentauons  at  point  (ij) 
to  d  e  (usually  one  dimensional)  iso-bnghtness  contour  of  value  l(i  j). 
The  simplest  and  most  common  reflectance  function  is  (he 
Lambertian,  or  matte-finish.  A  planar  Lambertian  surface  looks 
equally  bright  from  any  direcuon  (any  angle  of  emitlance), 
depending  only  on  the  angle  of  incidence  of  light.  This  is  because 
the  amount  of  light  emitted  by  a  small  patch  of  surface  increases  as 
the  cosine  of  the  angle  of  erruttance,  while  the  amount  of  surface 
subtended  by  a  solid  angle  increases  as  the  reciprocal  of  the  cosine 
of  the  angle  of  emittance.  Thus  the  contribution  of  the  emitlance 
angle  is  cancelled,  and  the  perceived  brightness  of  the  surface 
depends  only  on  the  angle  of  incidence.  The  computauonal 
simplicity  of  the  Lambertian  reflectance  (unction  make  it  a  favonle 

of  computer  graphics  and  computer  vision,  but  the  understanding  of 
reflectance  funcuons  is  becoming  quite  sophisticated  |Cook  and 
Torrance  1981;  Whined  1980). 

4.  Smoothness 

One  of  the  most  useful  constraints  on  object  shape  (unless  their 
geometric  form  is  known  [Woodham  1978a|)  is  the  assumption  that 
object  surfaces  are  ’’smooth".  Discontinuous  variation  in  surface 
normals  then  only  occurs  between  objects,  not  within  objects,  and 
thus  is  a  relatively  rare  event.  Ultimately,  the  smoothness  consuaint 
is  incorporated  into  an  error  term  that  penalizes  non-smoolh  surfaces 
and  encourages  smooth  ones.  Thus  the  choice  of  technical  definition 
for  smoothness  affects  the  perfoimance  of  shape  from-shading 
algorithms  (Section  5.3).  Barrow  and  Tenenbaum  |1981J  present 
useful  smoothness  measures  for  both  surfaces  and  space  curves. 
Oceuchi  and  Horn  |1981]  present  a  good  discussion  of  these  issues. 

Brooks  |1979)  and  Strat  (1979J  independently  developed 
algorithms  based  on  a  smoothness  criterion  involving  the  existence 
and  continuity  of  second  partial  derivatives  of  surface  height  z.  This 
criterion  seems  overly  stringent,  and  its  application  is  dependent  on 
the  gradient  space  representation  of  orientations. 

Ikeuchi  and  Horn  |1979]  use  a  criterion  whose  application  is 
representation-independent  A  smooth  surface  is  a  Cq  and  Cj 
function:  its  depth  has  no  discontinuities,  nor  do  its  first  partial 
derivatives.  Because  the  map  from  image  coordinates  to  surface 
orientation  is  continuous,  this  criterion  can  be  used  for  any 
continuous  orientation  space. 

With  this  criterion  a  natural  smoothness  error  term  ri  the  sum  of 
four  squared  differences  of  components  of  neighboring  points  in 
orientation  space. 

«ij  =  Kfi+ij  1i/  +  <fij+rfip2 

+  (8,+lj  ’  8,j)2  +  (8ij+ 1  ’  8,j)V  («•» 

The  analytical  minimization  of  the  Sjj  error  admits  of  solution  by  a 
relaxation  algorithm  (Section  5.4).  The  resulting  linear  interpolauon 
of  orientations  is  appealing  and  mathematically  justified,  and  we  use 
it  in  the  work  reported  here.  H  iwever,  the  very  simple  form  of  the 
iterative  equations  (eq.  (5.3.4))  nvites  the  substitution  of  nonlinear 
techniques  to  incorporate  ether  versions  of  the  smoothness 
constraint. 
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S.  Shape  from  Image 

This  section  briefly  outlines  a  few  "shape  from..."  techniques  that 
are  relevant  to  our  work.  Good  discussions  of  these  techniques 
appear  in  |Suat  1979;  -Ikeuchi  and  Horn  1981], 

5.1  Multiple  Images  -•  Photometric  Stereo 

This  technique  |Woodham  1978b.  Ikeuchi  to  appear]  is  very 
simple  in  concept.  For  a  given  imaging  geometry  and  image  intensity 
at  a  point  (id),  the  irradiance  equation  (3.1)  constrains  the 
orientations  to  the  (graph  of  the)  I(iJ)  iso-brightness  contour  in  the 
reflectance  map.  Another  imaging  geometry  similarly  yields  another 
locus  in  another  reflectance  map.  The  orientation  of  the  surface  at 
(ij)  must  be  consistent  with  both  intensities;  it  lies  m  the  intersection 
of  both  loot.  Ihe  easiest  way  to  change  the  imaging  geometry  is 
move  the  light  source  recall  that  changing  the  viewpoint  will  not 
alter  the  brightness  of  Lamberban  surfaces.  Three  equations  are 
needed  to  determine  the  reflectance  and  the  unit  normal,  so  we  need 
three  light  source  posiuons  at  which  point  (ij)  is  not  in  shadow.  Let 


each  source  posiuon  vector  be  denoted  by  nk,  k  =  1 . 3.  and 

rewrite  equation  (3.1)  as 

lk(ij)  -  r(nk.ns),  k  =  1.2,3.  (5.1.1) 

or  in  matrix  form  as 

I  =  rNns  (5.1.2) 

where 

i  =  [ii('J).  i20j).  i3(u)rr  (5.1.3) 

and 

nll  n12  n13 

N  =  n21  n2?  n23  (5.1.4) 

n31  n32  "33 

If  the  three  llluminant  directions  n,  are  not  coplanar,  N  is 
nonsingular  and  we  can  solve  for  r  and  iij,  for  example  by 

r  =  IN1]!.  (5.1.5) 

n,  =  (1/r)  N1!.  (5.1.6) 


Figure  1  shows  three  source  positions  with  a  Lambertian  surface 
and  gradient  space  orientation  representation,  and  how  the  iso- 
bnghtness  contours  of  the  three  corresponding  reflectance  maps 
intersect  at  a  unique  (p,q). 

5.2  Local  Analysis  -  Parameters  from  Differential  Shading 

local  shading  is  an  important  source  of  shape  parameters  and 
illuminant  direction  informauon.  Local  clues  provide 

constraints  that  can  be  used  with  considerable  effect  by  global 
relaxation  algorithms. 

The  thesis  of  Pentland  (1982a,  1982b]  thoroughly  explores  this 
use  of  local  (differential)  shading  measures.  Some  ’'on-stalisticalj 
conclusions  from  local  measures  are  immediate,  since  no  global 
calculations  tike  constraint  propagation  or  relaxation  are  involved. 
Other  (statistical)  conclusions  can  be  reached  by  accumulating  local 
evidence,  still  not  using  iteradve  relaxation.  They  yield  maximum- 
likelihood  estimates  of  unaging  parameters. 

Pentland  shows  how  to  determine  a  maximum-likelihood 
estimator  Tor  illuminant  direction  under  certain  reasonable 
assumptions  about  the  imaged  domain,  and  compares  the  results  of 
his  algorithm  to  human  performance.  Given  the  illuminant  direction, 
the  differential  shading  measures  can  provide  ^iproximations  to 
surface  curvature  properties.  The  local  determination  of  surface  tilt 
(direction  of  slam)  means  that  normals  are  approximately  consUained 
to  have  only  one  degree  of  freedom  -  this  can  be  quite  important  in 
a  global  relaxation  algorithm.  The  slant  determination  and  the 
illuminant  direcuon  calculation  are  ratisucal.  These  estimates  are 
obvious  choices  to  initialise  parameters  for  global  relaxation 
computations. 


5.3  Global  Relaxation  -  Shape  from  Line  Drawings 

Barrow  and  Tenenbaum  |1981|  investigate  the  reconstruction  of 
shape  from  outline  drawings.  They  note  that  boundary-defining  lines 
in  drawings  anse  from  what  they  call  extremal  boundaries  (which  we 
shall  call  smooth  occluding  boundaries)  and  discontinuity  boundaries 
(which  we  shall  call  shatp  occluding  boundaries)  (Fig.  2). 

Boundary  lines  contain  surface  normal  informauon.  Clearly  the 
line  of  sight  just  grazes  (is  orthogonal  to)  the  surface  at  a  smooth 
occluding  boundary,  and  the  surface  normal  there  is  folly 
determined,  being  orthogonal  both  to  the  line  of  sight  and  the 
boundary  contour.  At  a  sharp  occluding  boundary,  the  normal  is 
constrained  to  lie  in  the  plane  normal  to  the  (three-dimensional) 
space  curve  of  the  sharp  boundary.  Object  boundaries  are  an 
important  source  of  boundary  conditions  (known  orientauons)  for 
shape  from  shading  calculations  (Section  5.4). 

Given  a  line  drawing.  Barrow  and  Tenenbaum  propose  a  process 
to  interpolate  surfaces  inside  boundary  contours.  First  the  boundary 
contours  are  identified  and  labelled  as  smooth  or  sharp.  They  are 
projected  into  3-space  into  the  curve  minimizing  torsion  and  spaual 
derivative  of  curvature,  (maximizing  planamy  and  circularity).  Thus 
the  surface  normals  near  the  boundary  contours  are  known  exactly 
for  smooth  occluding  boundary  contours,  or  restricted  to  a  plane  for 
sharp  occluding  boundary  contours.  A  linear  mterpolauon  in  polar 
space  is  performed  in  the  intenor  of  a  region  bounded  by  the 
contours,  and  thus  even  in  the  absence  of  shading  a  surface  is 
constructed.  The  scheme  has  the  flavor  of  a  relaxation  algorithm 
because  the  implementation  interpolates  using  only  local  informauon 
that  must  propagate  around  the  region. 

The  surface  reconstructed  by  interpolation  is  the  correct  one  for 
the  sphere  and  cylinder,  whose  noimals  vary  bnearly  with  image 
displacement.  The  scheme  tnes  to  produce  surfaces  with  uniform 
mean,  principal,  and  Gaussian  curvature.  This  surface  reconsuucuon 
without  shading  informauon  is  intcresUng  as  a  pure  case  of  global 
relaxation  unaffected  by  shading  correcuons.  Ihe  fact  that  it  can 
work  reasonably  well  by  itself  makes  an  approximately  cylindrical  or 
herical  surface  a  poor  choice  to  demonstrate  the  capabilities  of  a 
ape-from-shading  system. 

5.4  Global  Relaxation  --  Shape  from  Shading 

Oui  work  is  based  on  the  method  of  Ikeuchi  and  Horn  (1981|, 
which  is  briefly  outlined  in  this  section. 

An  nportant  starung  point  for  shape  from  shading  calculanons 
is  bounaary  conditions,  which  m  this  context  are  known  values  of 
orientation  at  surface  points.  Without  these,  the  irradiance  equation 
(3.1)  has  infinitely  many  solutions  (Bruss  1981],  The  constrained 
values  of  orientation  can  arise  at  t  ;  boundary  of  the  object  (Secudh 
5.3),  or  from  specular  or  singular  points  that  indicate  an 
unambiguous  orientation.  With  a  Lambertian  reflectance  function 
and  known  light  source  position,  the  brightest  non-secular  points 
(the  singulai  points)  tn  the  image  may  be  taken  to  have  normals  in 
the  light  source  direction.  Simple  geometric  arguments  yield  Ihe 
orientation  at  specular  points. 

Given  adequate  boundary  condiuons.  the  goal  is  to  find 
orientations  elsewhere  on  the  surface  that  minimize  the  sum  of  two 
error  terms.  One  term,  r1Jr  measures  the  disagreement  between  the 

intensity  at  a  point  I(iJ)  (also  written  (,j)  and  the  intensity  predicted 
by  the  reflectance  map  (R(p(ij).q(ij))  oi  R(f[ij).g(ij)).  say)  for  the 
current  value  of  the  orientation  at  that  point. 

'ij  =  ('ij  '  R(f(j.  gjj))2  (5.4.1) 

The  other  error  term  is  (eq.  5.4.1),  measuring  departure  from 
surface  smoothness. 

The  total  error  e  is  thus 

e  =  2  2  (Sjj  +  w  ty)  (5.4.2) 

i  j 

where  w  is  a  weight  determining  the  relative  importance  of 
smoothness  and  reflectance  errors. 
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Ikeuchi  and  Horn  find  an  analytic  expression  for  the  minimum 
of  the  error  e,  and  then  construct  a  Gauss-Seidel  like,  parallel 
iterauve  relaxauon  algorithm  to  find  orientations  minimizing  e.  The 
computation  can  proceed  independently  in  the  two  orientation 
parameters  (in  their  case  f  and  g.  the  stereographic  coordinates),  and 
in  each  the  derivative  of  the  smoothness  error  term  is  simply 
proportional  to  the  difference  of  the  component  and  the  average  of 
us  four  neighbors.  For  f, 

isz-ifij  =  2f*jj 

=  2lfij  -  C.  +  lj  +  f.j  +  l  (5-4.3) 

The  relaxation  algorithm  computes  a  new  fg  (and  gg)  in  each 

iteration;  the  new  component  is  the  average  of  the  four-neighbors  in 
the  last  iteration,  modified  by  a  term  expressing  the  shading 
constraint.  In  terms  of  (f.g)  space,  the  algorithm  first  initializes  f®g 
and  g°g  to  the  apuon  known  orientations,  and  then  on  the  n  +  lst 
iteration, 

+  \j  =  f*n,j  +  w|  1(.  j)  -  R(f*n1J,  g*ny)J  3  R/»  f 

gn+  \j  =  6*"ij  +  w|  l(ij)  -  R(f*nlr  g*"g))  3  R/ag  (5.4.4) 

This  relaxation  algonthm  finds  surface  orientations  that  minimize 
a  weighted  sum  of  error  measures,  given  an  intensity  image  and  a 
reflectance  map  as  input.  The  algorithm  has  many  pretty  features, 
and  much  that  is  usable  in  other  contexts.  It  has  the  limitation  that 
considerable  apnori  information  about  the  imaging  situation  must 
exist. 

6.  Parameter  Transforms 

The  parameter  transform  is  a  generalizauon  of  the  Hough 
transform  [Duda  and  Hart  1972;  Ballard  1981;  Brown  1982).  As  such 
it  has  well-known  pleasant  properues  of  robustness  in  noise  and  low 
computational  overhead.  It  is  also  of  interest  as  a  computational 
mechanism  that  is  implementable  in  connectionist  architectures 
(Feldman  and  Ballard  1981). 

Parameter  transforms  can  occur  between  any  two  related  spaces. 
Let  a  parameter  (perhaps  read  off  an  input  parameter  image)  be  a 
vector  (x,  a(x))  in  a  space  A.  and  another  parameter  (perhaps  a 
feature  parameter  derived  from  image  space)  be  a  vector  b  in  a  space 
Bv  Then  there  is  often  a  physical  constraint  F  that  relates  a(x)  and  b 
so  that  F(a.b)  =  0.  Sometimes  F  may  be  expressed  analytically  (as  in 
the  Hough  uansform  for  paramedic  curves  (Duda  and  Hart  1972J); 
someumes  F  is  best  expressed  as  a  relation  table  (as  in  the  GHough 
method  for  nonparametnc  shape  detection  [Ballard  1981)). 

A  particular  image  may  give  nse  to  a  set  of  values  fak)  from 
parameter  space  A,  where  ak  =  a(xk).  The  set  {ak}  is  only 
consistent  with  certain  elements  in  the  space  B  allowed  by  the 
relation  F.  The  parameter  uansform  is  a  way  of  computing  which 
elements  those  are  and  finding  the  most  popular  one.  Fof  each  ak 
compute  the  set 

Bk  =•  {b  |  ak  and  F(ak,b)  <  6  b}.  (6.1) 

Bk  is  the  set  of  elements  in  space  B  that  are  consistent  with  ak. 
Define  H(b)  as  the  number  of  limes  the  value  element  b  occurs  in 
the  union  of  all  the  sets  Bk.  H(b)  is  thus  a  histogram  that  counts  the 
number  of  "votes"  collected  by  parameter  b  on  the  basis  of  evidence 

Jirovided  by  the  ak,  and  is  the  output  of  the  parameter  Uanform 
torn  space  A  to  space  B  for  the  image  described  by  {ak}.  Usually  A 
and  B  are  discrete,  and  6  b  relates  to  the  quantization  in  B.  If  H  is 
normalized,  a  normalized  confidence  measure  that  the  feature  with 
parameter  b  is  present  in  the  image  is 

C(b)  =  H(b)  /  ZH(b).  (6.2) 

7.  Implementation 

This  work  assumes  an  orthographic  image  projection,  with  a 
pant  llluminant  at  infinity.  With  a  Lambertian  reflectance  function, 
any  constant  distribution  of  illumination  can  be  achieved  by  one 
pant  source  [Silver  1980;  Pentland  1981).  We  use  the  polar  space 
(Caussian  sphere)  orientation  representation,  and  have  also  tried  the 
spherical  (p,p)  representation. 


One  of  our  experiments  was  to  denve  shape  and  llluminant 
direction  in  parallel,  using  parameter  Uansforms.  Intuitions  from 
photometric  stereo  (Section  5.1)  help  in  understanding  the  parameter 
transform  approach  to  calculating  illuminanl  direction.  In 
photometric  stereo,  thiee  known  iUuminant  positions  allow  surface 
normal  determination.  In  a  dual  problem,  by  reversing  the  roles  of 
the  vectors,  three  known  surface  normals  allow  determinauon  of  the 
llluminant  direction.  An  image  intensity  lg  at  a  point  whose  surface 
normal  is  n  =  (x,y,z)  is  consistent  with  a  set  of  illummauon 
directions  forming  a  cone  in  Cartesian  space.  The  cone's  axis  is  n, 
and  the  set  includes  all  directions  making  the  correct  constant  angle 
with  n.  Rather  than  follow  'he  analytic  dual  soluuon,  we  use  a 
parameter  transform  approach  that  does  not  involve  simultaneous 
equauon  solution. 

In  a  stage  of  the  illuminanl  direcuon  calculation,  surface  points 
with  hypothesized  directions  each  "vote"  for  the  set  of  llluminant 
directions  with  which  they  are  consistent. 

The  Lambertian  reflectivity  function  means  that  the  irradiance 
equation  is  simply 

l,j  =  k  ng.L. 

That  is,  the  brightness  at  image  locauon  (ij)  is  proporuonal  to  the 
cosine  of  the  angle  between  the  normal  n  at  (ij)  and  the  illuminanl 
direction  L. 

In  the  notauon  of  Section  6,  a  is  die  intensity  and  surface  normal 
direction  informauon,  b  is  the  llluminant  direction,  and  F  is  the 
angle  condition. 

*  =  llg.  "ijl 

b  =  |L] 

F  =  k  ng-L  -  Ig  =  o. 

In  a  parameter  transform  implemcntauon.  the  votes  are  collected 
in  a  discrete  accumulator  space  of  orientations.  After  all  the  voting, 
the  cell  with  the  most  voles  is  taken  to  indicate  the  illuminanl 
direction  for  the  next  round  of  shape  calculation.  Choosing  the 
direction  with  the  most  votes  is  equivalent  to  using  a  mode,  rather 
than  a  mean  or  least-squared  error  regression  calculauon.  The 
parameter  transform  thus  is  unaffected  by  noise  unul  the  noise 
actually  overwhelms  the  good  data.  The  accumulator  space  for 
illumination  directions  is  a  geodesic  tesselation  of  the  Gaussian 
sphere  [Brown  1979);  this  construction  partiuons  the  sphere  into 
approximately  congruent  cells.  We  index  die  data  structure  by 
treating  each  subdivided  icosahedral  face  (there  are  twenty  of  them, 
each  with  N2  cells)  as  two  triangular  arrays,  one  for  cells  pointing 
"up",  one  for  cells  pointing  "down"  (Figure  3). 

8.  Experimental  Results 

8.1  Orientation  Spaces 

Eq.  (5.3.4),  though  derived  from  least-squares  minimization 
arguments,  has  a  simple  geometric  inlerprelauon,  pointed  out  in 
(Ikeuchi  and  Horn  1981).  This  is  simply  that  in  the  iterative  process, 
the  normal  vector  at  the  (n+l)si  iteration  is  a  weighted  vector  sum 
of  two  other  vectors.  The  first,  incorporating  the  smoothness 
constraint,  is  the  sum  of  neighboring  normal  vectors  (of  the  nth 
iteration).  The  second,  incorporating  the  irradiance  equation 
constraint,  is  a  vector  in  the  gradient  direction  of  the  reflectance  as  a 
function  of  surface  normal.  Think  of  the  new  surface  normal  being 
an  average  of  old  neighboring  ones  (for  smoothness),  tilted  so  as  to 
be  belter  in  agreement  with  the  known  radiance  of  the  image. 

The  various  projections  of  the  gaussian  sphere  onto  the  plane 
require  more  or  less  complex  expressions  for  the  gradient,  but  in 
polar  orientation  space  it  is  easy  to  show  that  reflectance  functions 
depending  only  on  the  angle  of  incidence  (such  as  Lambertian)  have 
their  gradient  in  the  direction  of  the  illuminanti  Let  I  be  the  unit 
vector  in  tile  direction  of  the  iUuminant.  n  be  the  surface  normal 
unit  vector,  and  i  be  the  angle  of  incidence.  Let  R(i)  =  Rl(cos(i))  - 
Rl(ln).  Then 

gradR  =  (  d  Rl/  3  cosi)  I. 
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Thus  for  any  reflectance  function  that  is  circularly  symmeuic 
about  the  illuminant  direction,  the  shading  adjustment  is  the 
iteration  will  simply  bend  the  vector  toward  or  away  from  the 
illuminant  We  wondered  how  these  geometric  iniuiuons  were 
mirrored  in  algorithms  using  different  parameterizations  of 
orientation,  and  whether  there  were  some  advantage  to  using  one  or 
another  orientation  space  parameterization. 

Some  orientation  space  parameienzalions  have  singularities  and 
infelicities  such  as  modular  indices  (gradient  and  (0,9)  space,  for 
example).  The  process  of  "averaging  neighbors"  is  usually  taken  to 
operate  componentwise.  We  found  that  to  be  painful  in  (6.<p)  space, 
and  geometrically  it  seems  an  approximation  to  averaging  in  polar 
space  (i.e.  just  taking  the  average  of  unit  vectors  in  3-space).  Thus 
even  when  working  in  ($.<p)  space  we  oomputed  vector  averages  in 
polar  space.  With  that  proviso,  the  convergence  and  accuracy  of 
relaxation  schemes  seemed  unaffected  by  the  orientation  space 
parameterization.  However,  we  see  little  reason  ever  to  work  in 
projections  of  the  gaussian  sphere  -  polar  space  seems  the  natural 
choice. 

8.2  Step  Size 

The  relaxation  method  dictates  a  shading  adjustment  to  be 

Sued  after  the  smoothness  adjustment.  There  is  always  a  value 
given  or  being  iteratively  adjusted)  for  the  illuminant 
direction.  It  thus  is  always  possible  to  compute  the  lilt  of  the  norma) 
vector  that  would  satisfy  trie  irradiance  equation  constraint  exactly. 
We  wondered  how  big  an  adjustment  in  the  direction  of  fastest 
improvement  be  made. 

We  tried  applying  varying  amounts  of  the  shading  correction 
(from  all  of  it  to  about  one  tenth  of  it).  Unsurprisingly,  we  found 
that  applying  smaller  corrections  produced  results  of  higher  accuracy. 

8.3  Reflectance  Functions 


The  surface  has  a  maximum  and  minimum  where  the  surface 
normal  changes  sense  in  the  x  direction:  all  normals  are  in  the  x-z 
lane.  The  results  of  these  experiments  are  shown  in  Figures  4  and 

Figure  4a  shows  the  input  image  (the  illuminant  is  at  the 
viewpoint).  When  normals  are  known  apnort  for  only  the  top  and 
bottom  boundary  (Fig.  4b).  the  normals  are  reconstructed 
immediately  (Fig.  4c)  with  no  appreciable  error.  When  the  apriori 
knowledge  is  0?  the  left  and  right  boundanes.  the  result  (not 
illustrated)  is  disastrous,  with  the  smoothness  constraint  preventing 
the  normals  from  changing  direction  across  the  peak  and  trough.  The 
resulting  surface  fits  the  irradiance  consuainl  well,  but  has  serious 
and  permanent  normal  direction  errors. 

Figure  5  shows  the  case  that  normals  are  known  apriori  around 
the  entire  boundary  (Fig.  3a).  After  some  time,  an  intermediate  stage 
is  reached  at  which  conflicting  normal  information  has  propagated  w 
from  the  sides  and  met  along  the  diagonals.  Here  the  averaging 
process  yields  equivocal,  results,  and  the  surface  has  severe  infelicities 
m  ooth  smoothness  and  irradiance  (Fig.  5b. c).  After  a  time  the 
smoothness  constraint  has  won  out,  deriving  the  expected  surface 
(Fig  Sd,  e.  f).  The  irradiance  consuainl  weight  was  important  here: 
with  the  irradiance  consuainl  set  to  add  100%  of  the  shading 
correction,  a  stable  point  is  reached  at  the  siluauon  of  5b  and  c.  This 
equilibrium  is  fragile:  when  at  one  point  a  program  bug  introduced  a 
minor  assymetry,  convergence  to  the  "expected"  surface  occurred. 

8.5  Boundary,  Illuminant,  and  Aprion  Consuaint 

We  wanted  to  explore  the  interaction  of  boundary  consuaint, 
illuminant  direction  consuaint,  and  apriori  knowledge  of  boundary 
orientation.  To  that  end  we  ran  a  series  of  six  experiments.  For  each 
we  used  the  following  setup. 

Surface:  f(x,y)  =  cos(xy),(-2.1  <  =  x,y<=  2.1) 


The  nonlinearity  of  the  Lamberuan  reflectance  function  (a 
cosine)  means  that  the  computation  of  surface  normal  corrections  is 
more  sensitive  for  incidence  angles  near  zero,  where  the  cosine  is 
relatively  flat.  A  small  difference  in  intensity  can  signal  a  relatively 
large  difference  in  surface  normal.  We  conjecture  that  noise  in  the 
intensity  image  (which  for  many  common  noise  processes  is 
proportional  to  the  intensity  or  its  square  root)  could  combine  with 
the  properties  of  the  Lambertian  reflectance  function  to  distort  the 
derived  shape.  To  test  this  hypothesis,  we  consuucted  unages  arising 
from  the  reflectance  function  that  is  linear  in  the  incidence  angle. 
This  reflectance  function  performs  identically  to  the  Lambertian  in 
noiseless  conditions.  Its  performance  in  noisy  conditions  is  a  matter 
for  future  research,  and  is  mildly  interesting  because  the  Lambertian 
reflectance  function  theoretically  has  higher  noise  resistance  for  low 
angles  of  incidence  but  lower  noise  resistance  for  high  angles  of 
incidence. 


8.4  Consuaint  Interacuons 

We  explored  how  the  surface  smoothness  consuaint  interacted 
with  apriori  knowledge  and  boundary  constraint. 


Resolution: 

f  defined  on  33  x  33  grid 

Orientation  Paiamelers: 

polar 

Smoothness  Constraint: 

average  four-neighbors 

Irradiance  Consuaint: 

approx.  ,2(eniire  shading  correction) 

Hough  Space: 

order  7  icosahedral  geodesic 
tesselation  (49*20  cells) 

Hough  Voting  Scheme: 

one  vole  per  cel) 

Illuminant  Direction: 

known  or  unknown  (two  cases) 

Apriori  Consuaint: 

orientations  known  or  unknown  (two 
cases) 

Boundary  Consuaint: 

known  orientations  fixed  or 
orientauons  constrained  to  lie  in  plane 
(two  cases) 

Surface:  f(x.y)  =  sin(x),  (-1.9  <=  x  <=  1.9). 


Some  explanation: 


Resolution:  f  defined  on  33  x  33  grid 


Orientation 

Parameters:  polar 

Smoothness 

Constraint:  average  four-neighbors 

Irradiance 

Consuaint:  ,2(entire  shading  correction) 


Apriori  Knowledge:  1.  orientations  of  entire  boundary  known 
2.  orientation  known  00  opposite  side*  (two 
cases) 


Boundary 

Constraint:  initial  orientations  fixed,  unknown  flee. 


The  surface  is  not  radially  symmetric,  and  has  enough  variation 
in  its  normals  to  provide  a  challenge  to  the  shape-ftom  shading 
algorithm.  As  set  forth  in  Section  7,  we  use  a  Hough  transform 
technique  to  attempt  to  derive  illuminant  direction  in  parallel  with 
the  shape  determination.  With  this  technique,  used  in  the  unknown 
illuminant  cases,  some  of  the  already-computed  surface  normals 
"vote"  for  the  illuminant  direction,  in  a  dual  of  the  photometric 
stereo  process.  In  Fig.  1,  we  can  imagine  each  ellipse  being  a  pattern 
of  votes:  where  the  three  ellipses  intersect,  there  will  be  a  peak  of 
votes,  and  that  is  the  desired  normal 

Similarly,  in  the  illuminant  direction  computation,  each  normal 
casts  a  circle  of  votes  onto  the  gaussian  sphere,  denoting  all 
illuminant  directions  at  the  angle  with  foe  normal  that  produces  the 
known  intensity.  There  are  many  variations  in  how  such  votes  can  be 
cast  In  terms  of  |Brown  and  Curtiss  1982L  this  scheme  is  "jagged" 
(not  anti-aliased,  one  vote  tier  cell).  We  nave  also  implemented  a 
CHough  scheme  (Brown  1982];  in  noiseless  data  it  does  not  improve 
foe  final  results. 


i 
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The  orientations  of  normals  around  the  border  are  initially  taken 
either  to  be  known  exactly  or  to  be  smoothly  related  to  eacn  other 
but  not  correctly  known.  In  the  latter  case  the  corners  are  left  in  the 
correct  orientation  and  as  x  and  y  approach  0  the  normals  are 
increasingly  rotated  in  the  plane  normal  to  the  boundary  curve.  The 
rotation  increases  by  3  degrees.per  boundary  element  linearly  toward 
a  maximum  of  45  degrees. 

The  boundary  constraint  is  either  that  the  boundary  normals  are 
fixed  and  oorrect  or  that  they  are  constrained  to  rotate  in  a  plane 
normal  to  the  boundary  curve.  The  former  constraint  is  the  "smooth 
occluding  contour"  constraint  [Ikeuchi  and  Horn  1981),  the  laUer  the 
"sharp  occluding  contour"  constraint  (Barrow  and  Tenenbaum  1981). 
we  did  not  run  the  case  that  the  orientations  are  incorrect  but  fixed 
(hence  only  six  cases). 

The  cases  in  which  the  iUuminant  is  known  are  comparable  to 
the  usual  shape-from-shading  results  (e.g.  IStrat  1979,  Ikeuchi  and 
Horn  1981J).  A  case  in  which  the  boundary  is  known  and  is 
immediately  used  to  calculate  the  illuminant  direcuon  is  a  Hough 
implementation  of  solving  the  dual  photometric  stereo  equations.  It 
is  not  interesting  except  as  a  test  that  the  votes  go  to  the  nght  places. 
After  such  a  start,  the  situation  is  the  same  as  it  the  illuminant  were 
known  apriori. 

Figure  6  shows  the  original  surface. 

Figure  7(a  -g)  shows  Case  1,  approximaung  the  usual  shape-ftom 
shading  case  with  smooth  occluding  boundary,  in  which  illuminant 
direction  is  known  and  boundary  normals  are  known  and  fixed.  In 
these  runs,  illuminant  is  at  6  -  19  or  20  degrees,  and  [phi]  is 
variously  at  45.  90.  and  135  degrees.  7a-d  show  the  initial,  correct, 
and  computed  normals  and  the  normal  differences.  7e  shows  the 
input  image,  7f  shows  the  image  that  would  result  from  the 
reconstructed  surface,  and  7g  mows  the  reconstructed  surface 
integrated  from  its  surface  normals.  In  all  the  cases,  the  irradiance 
errors  are  negligable,  so  the  image-like  figures  will  not  be  repeated. 
The  shading  term  is  ,2(complele  correction).  The  computed  normals 
are  not  perfect,  being  at  their  worst  where  they  have  to  change  sign 
in  x  or  y.  This  could  result  from  the  "looseness"  of  the  sun  in  these 
combined  calculations,  where  illuminant  position  is  influenced  by  the 
surface  calculations. 

Fig.  8(a  -  d)  shows  Case  2,  in  which  the  illuminant  direcuon  and 
boundary  normals  are  both  known  correctly,  but  the  normals  are  free 
to  rotate  in  the  plane  normal  to  the  boundary  curve  (the  sharp 
occluding  boundary  case).  The  shading  term  is  .^complete 
correction).  Fig.  8a  is  iniual  normals,  8b,c  show  normals  after  25 
iterations  and  in  their  final  position.  8d  shows  the  normal  errors.  The 
normal  convergence  is  often  poor  until  the  normals  propagating  in 
from  the  boundaries  meet  in  the  center.  This  might  be  alleviated  by 
increasing  the  shading  weight  to  overcome  the  tendency  of  normals 
to  stay  at  the  slant  of  the  boundary  normals. 

Figure  9(a-d)  show  Case  2:  illuminant  direction  is  known,  initial 
normals  incorrect  and  free  to  rotate  in  the  plane  orthogonal  to  the 
boundary  curve.  Shading  term  is  ,3(complete  correction).  The  initial 
boundary  normals  are  smoothly  related,  but  twisted  as  they  get 
farther  from  the  comers  (compare  9a  to  8a).  9b,c,  and  d  show  the 
correct  normals,  computed  normals,  and  the  normal  errors.  The 
computed  normals  are  fairly  accurate,  and  again  the  irradiance  error 
is  negligable,  showing  that  shading  conecUons  can  overcome  bad 
boundary  information. 

Figure  10(a-d)  shows  Case  4:  illuminant  direcuon  unknown, 
boundary  normals  initially  correct  and  fixed.  This  is  close  to  a 
Hough  implementation  of  the  photometric  stereo  dual  It  is  not 
quite,  because  the  initially  known  normals  are  not  allowed  to  vote 
for  iUuminant  direcuon,  but  the  inner  rings  obtained  by  averaging 
are.  The  iUuminant  is  at  (19,  90),  initiHy  guessed  at  (0,0),  U. 
overhead,  or  in  the  viewing  direction.  Shading  term  is  ifcomplete 
correction).  10a  and  b  show  the  computed  normals  and  errors,  10c 
shows  the  Hough  accumulator  array  with  votes  for  sun  position,  and 
KM  the  reconstructed  surface. 

Figure  ll(a-e)  shows  Case  5:  iUuminant  direction  unknown, 
boundary  normal*  initially  correct  but  fret  to  rotate  in  the  plane 
orthogonal  to  the  boundary  curve.  Here  again  the  initiaUy  correct 
boundary  normals  helped  the  Hough  process  find  the  correct 
iUuminant  direction  immediately.  Shading  term  is  ^complete 
correction).  11a  and  b  show  the  computed  normals  and  errors  after 
21  iterations.  11c  and  d  stow  two  displays  of  the  accumulator  array 
for  iUuminant  direction,  and  lie  stows  the  reconstructed  surface. 


Figure  12(a-e)  shows  Case  6:  illuminant  direcuon  unknown, 
boundary  normals  initially  incorrect  and  free  to  rotate  in  the  plane 
orthogonal  to  the  boundary  curve.  Here  the  boundary  normals  were 
twisted  by  up  to  45  degrees  as  described  above.  The  illuminant  was 
at  (19,  135),  and  guessed  inually  to  be  at  (0,0).  Shading  term  is 
.2(complete  correcuon).  12a  and  b  give  the  computed  normals  and 
differences,  showing  significant  errors.  The  iniual  Hough  operation 
(shown  in  12c  and  d)  does  not  yield  a  sharp  peak  for  the  iUuminant 
direcuon  (this  is  no  surprise).  The  derived  direcuon  was  between  die 
correct  direcuon  and  the  first  guess.  The  next  iterauon  of  illuminant 
calculaUon  (12.e)  in  fact  found  the  correct  accumulator  bin,  and  after 
two  more  iterauons  settled  down  to  within  .05  radian  of  the  correct 
illuminant  direction. 

9.  Conclusions  and  Future  Work 

The  main  conclusions  we  draw  are  the  foUowing. 

1.  Iterative  relaxation  schemes  for  shapt-from-shading 
calculauons  are  robust  over  several  design  choices. 

2.  The  Gaussian  sphere  (polar  space)  parametenzauon  of 
oriemauon  seems  natural,  adequate,  and  elegant. 

3.  Iterative  schemes  may  suffer  from  significant  anomalies 
resulUng  from  discrete  approximauons  to  continuous  quantifies. 

4.  There  are  several  potenually  important  ideas  for  improving 
performance  of  iterative  relaxauon  schemes  (see  below). 

5.  It  is  possible  to  use  Hough  methods  to  derive  iUuminant 
direction  in  parallel  with  shape-from-shading  calculauons  only  so 
tong  as  the  normals  are  fairly  weU  known.  Otherwise  their  voting 
behavior  is  too  incoherent  or  wrong  to  be  overcome  by  smoothness 
constraints,  and  the  derived  surface  is  distorted. 

Thanks  to  this  work  and  to  that  of  others,  we  have  several  ideas 
for  improving  the  performance  of  these  algorithms. 

1.  Use  of  local  information.  The  work  of  Pentland  (1982)  shows 
how  information  about  slant,  tilt,  and  iUuminant  direcuon  may  be 
inftned  immediately  from  local  measures  of  intensity  differenuals. 
Especially  when  the  image  is  not  noisy,  such  information  can  provide 
both  apriori  information  and  constraint  for  parallel-iterative  schemes. 
In  foci  quite  reasonable  shape  results  may  be  derived  strictly  with 
local  clues,  using  no  global  relaxation.  Pentland's  version  of  our 
derivation  of  shape  and  illuminant  direction  in  paraUel  is  his 
"bootstrapping"  method  of  iterative  successive  computation  of 
illuminant  direction  and  shape. 

2.  The  issue  of  smoothness  constraint  is  an  interesting  one.  The 
one  we  used  (that  of  Ikeuchi  and  Horn  (19811)  is  best  satisfied  by  a 
flat  surface.  In  surfaces  with  substantial  difference  in  principal 
curvatures  at  a  point,  (especially  surfaces  with  zero  or  negauve 
gaussian  curvature),  it  may  be  useful  to  try  to  decouple  the 
smoothness  measure  in  the  two  principal  directions.  One 
approximation  to  this  is  still  to  enforce  smoothness  through 
averaging,  but  only  along  iso-brightness  contours.  This  accomplishes 
the  decoupling  for  cylindrical  surfaces,  though  not  exactly  for 
hyperbolic  surfaces. 

3.  Performance  in  noisy  conditions  was  investigated  by  Suit 
(19791,  and  similarly  we  would  like  to  study  performance  of  the 
Hough  scheme  with  various  reflectance  functions  in  noisy  conditions. 
Also  of  interest  is  the  CHough  scheme,  which  sharpens  peaks  in 
parameter  space  (Brown  1982b). 

4.  This  work  is  one  of  several  applications  we  should  like  to 
ur^riement  with  a  cache-based  Hough  Transform  |Brown  and  Sher 

REFERENCES 

Ballard,  DJi.  Generalizing  the  Hough  transform  to  detect  arbitrary 
shape*.  Pattern  Recognition  13,  2,  111  -  122.  1981. 

Ballard,  D.H.  and  O.A.  Kimball  Rigid  body  motion  from  depth  and 
optical  flow.  TR  70.  Computer  Science  Department,  Univ.  of 
Rochester,  Dec.  1981.  (submitted  to  CGIP  ipearf  issue  on 
computer  vision). 


84 


Barrow,  H.G.  and  J.M.  ienenbaum.  Recovering  intrinsic  scene 
characteristics  from  images.  In  Computer  Vision  Systems,  A.R 
Hanson  and  E.M.  Riseman  (eds),  Academic  Press,  New  York 
1978. 

Barrow,  H.G.  and  J.M.  Tenenbaum.  Interpreting  line  drawings  as 
three-dimensional  surfaces.  Artificial  Intelligence  17,  75-116, 

1981. 

Brooks,  M.J.  Surface  normals  from  closed  paths.  Proc  Sixth  UCAI, 
Tokyo  Japan,  98-101,  1979. 

Brown,  C.M.  Two  descriptions  and  a  two-sample  test  for  3-D  vector 
data  TR  49,  Computer  Science  Department,  Univ.  of 
Rochester,  Feb.  1979. 

Brown,  C.M.  Bias  and  noise  in  the  Hough  transform  I:  theory.  TR 
105,  Computer  Science  Department,  Univ.  of  Rochester,  May 

1982. 

Brown,  C.M.,  and  M.  Curtiss  Bias  and  noise  in  the  Hougjt  transform 
II:  experiments.  TR113,  Computer  Science  Department,  Univ. 
of  Rochester.  August  1982. 

Btuss,  A.R.  The  image  irradiance  equation:  its  solution  and 
applicauon.  PhD.  thesis  and  MIT  AI  Memo  623,  June  1981. 

Cook,  R.L.,  and  K.E.  Torrance.  A  reflectance  model  for  computer 
graphics.  Computer  Graphics  15,  3  (proc.  SIGGRAPH  81)  307- 
315.  August  1981. 

Duda  R.O.  and  P.E.  Hart.  Use  of  the  Hough  uansformation  to 
detect  lines  and  curves  in  pictures.  Comm  ACM  15,  1,  11-15. 
Jan.  1972. 

Horn,  B.K.P.  Shape  from  shading:  a  method  for  obtaining  the  shape 
of  a  smooth  opaque  object  from  one  view.  Tech.  Report 
MAC-TR-79.  Project  MAC,  MIT.  1970. 

Horn.  B.K.P.  Obtaining  shape  from  shading  informauon,  in  Winston, 
P.H.  (Ed.),  Psychology  of  Computer  Vision,  McGraw-Hill,  NY 
115  155,  1975. 

Horn,  B.K.P.  Understanding  image  irradiances.  Artificial  Intelligence 
8.  201-231,  1977. 

Horn,  B.K.P.  and  R.W.  Sjoberg.  Calculaung  the  reflectance  map. 
Applied  Optics  18.  11.  1770  1779. 

Huffman,  D.A.  Surface  curvature  and  applications  of  the  dual 
representauon.  In  Computer  Vision  Systems.  A.R.  Hanson  and 
E.M.  Riseman  (eds).  Academic  Press,  New  York  1978. 

Gauss,  K.F.  General  investigations  of  curved  surfaces  (1827), 
Translated  by  A.  Ililtbeitel  and  J.  Morehead.  Raven  Press, 
Hewlett.  N  Y.  1965. 

Gilchrist,  A.L.  The  percepuon  of  surface  blacks  and  whiles. 
Scientific  American  240,  3,  112-124,  March  1979. 

Grimson.  W.E.L.  A  computauonal  theory  of  visual  surface 
interpolation.  MIT  At  Memo  613,  June  1981. 

Dteuchi.  K.  Determination  of  surface  orientations  of  specular  surfaces 
by  using  the  photometric  stereo  method.  IEEE- PAM  I,  to 
appear. 

Ikeuchi.  K.  and  B.K.P.  Hom.  Numerical  shape  from  shading  and 
occluding  boundaries.  Artificial  Intelligence  17,  141-184,  1981. 

Mackworth.  A.K.,  Interpreting  pictures  of  polyhedral  scenes. 
Artificial  Intelligence  4,  121-137,  1973. 

Nicodcmus.  F.R,  J.C.  Richmond,  J.J.  Hsia.  LW.  Ginsberg,  and  T. 
Lumpens.  Geometrical  considerations  and  nomenclature  for 
reflectance.  NBS  monograph  160,  U.S.  Dept  of  Commerce, 
NBS,  1977. 


Pentland,  A.  P,  Local  computauon  of  shape.  Technical  Memo,  SRI 
International,  preliminary  version,  1982b. 

Silver,  W.  Determining  shape  and  reflectance  using  multiple  images. 
S.M.  thesis.  Dept,  of  EE  and  CS,  MIT,  1980. 

Strat,  T.M.  A  numerical  method-  for  shape  from  shading  from  a 
single  image.  S.M  thesis,  Dept,  of  EE  and  CS,  MIT,  1979. 

Whitted,  T.  An  improved  illuminauons  model  for  shaded  display. 
Comm  ACM  23,  6.  343-349.  June  1980. 

Woodham,  R.J.  Reflectance  map  techniques  for  analyzing  surface 
defects  in  metal  castings.  PhD  thesis  and  MIT  Al  Memo  457, 
June  1978a. 

Woodham,  R.J.  Photometric  stereo,  MIT  AI  Memo  479,  June  1978b. 


Figure  1:  l'hree  light  source  posiuons  have  three  associated 
reflectance  maps.  In  each  situauon,  an  image  point  (ij)  has  a 
brightness  that  can  be  explained  by  a  locus  of  orientations  (here, 
points  in  p.q  space).  The  intersection  of  these  loci  is  the  orientauon 
of  the  surface  al  the  point 


Figure  2:  (a)  The  solid,  smooth,  potato-like  object  has  a  smooth 
occluding  boundary,  at  which  the  normal  is  uniquely  determined,  (b) 
The  normal  near  the  boundary  of  the  pouto-chip-like  object  has  a 
sharp  occluding  boundary,  at  which  the  normal  is  constrained  to  lie 
in  a  plane. 


Pentland,  A.  P.  Local  computation  of  shape,  PhD  thesis,  MIT,  1962a. 
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Figure  8c  Figure  8d 

Figure  8(a-d):  Case  2:  aprior:  known  illuminanl,  initially  correcl 
boundary  normals  free  to  rotate  in  plane  (see  text). 


Figure  7(a-g):  Case  1:  apnon  known  illumininant,  initially  correct 
and  fixed  Boundary  normals  (see  text). 


Figure  9(a-d):  Case  1:  apnon  known  iLuminant  initially  incorrect 
boundary  normals  free  to  rotate  in  plane  (see  text). 


Figure  11a 
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Figure  11c 


Figure  12c 


Figure  12b 


Figure  12a 


Figure  10c 


Figure  lOd 

Figure  10  (a-d):  Case  4:  unknown  iUuminani.  miually  correct  and 
fi.  ed  boundary  normals  (see  text). 


Figure  lid 


Figure  lie 

Figure  ll(a-e):  Case  5:  unknown  illuminant,  miually  correct 
boundary  normals  free  to  rotate  in  plane  (see  text). 
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Figure  ’2e 


Figure  12(ae):  Case  6:  unknown  iUuminant,  initially  incorrect 
boundary  normals  free  to  rotate  in  plane  (see  text) 


APPENDIX 

Invariant  Image  under  Varying  IUuminant  Direction:  The 
Unconstrained  One-Dirrensional  Case 

For  this  exerase,  assume  the  brightness  at  a  point  is  some 
function  of  the  incidence  angle  (a  generalization  of  Lambertian 
reflectance).  The  viewer  is  at  infinity  on  the  +y  axis  (orthogonal 
projecUon)  and  a  point  iUuminant  is  at  infinity  in  the  +y  halfplane. 
There  are  no  boundary  condiuons:  the  slope  of  the  1-D  ''surface"  is 
nowhere  constrained. 

The  1-D  "su-face"  is  a  funcuon 

y  =  fix). 

For  a  given  illuminauon  angle,  the  brightness  of  the  surface  at 
any  posiuon  x  is  determined  by  its  slope  there.  (Fig.  A.l). 

If  the  lUuminauon  angle  were  to  change  by  -S,  then  the  same 
image  would  tesuli  if  all  the  slopes  changed  by  —6  to  compensate 
determining  a  new  surface  g(x).  Since  in  the  original  surface  fix), 

tangent  angle  of  fix)  at  x  =  y, 

and 

fix)  =  df/dx  =  tany. 

In  the  compensating,  ulted  surface  g(x), 

tangent  angle  of  g(x)  at  x  =  y  -  5  =  /?, 
and 

g'(x)  =  dg/dx  =  tan/?  =  (y  +  tan*)  /  (1  -  ytan«). 
Thus 

g(x)  =  /f(x)  /  (1  -  f(x)tanj)  dx 

+  /tan*  /  (1  -  f(x)tan«)  dx. 

For  example,  if 
fix)  =  x2. 
then 

g(x)  =  |  -x/tan«  -  (l/2)csc2*<log(l-2xtar.*))|. 
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/  Abstract 

Given  a  line  drawing  from  an  image  with  shadow  regions  identified, 
the  shapes  of  the  shadows  can  be  used  to  generate  constraints  on 
the  orientations  of  the  surfaces  involved.  This  paper  describes  the 
theory  which  governs  those  constraints  under  orthography. 


The  correspondence  problem  has  been  explored  primarily  by 
Lowe  and  Binford  [7J.  They  describe  several  properties  of  this 
correspondence,  and  include  descriptions  ot  the  special  points  ot 
view  from  which  degenerate  cases  arise.  O'Gorman  [11]  proposed 
a  heuristic  method  for  finding  correspondences  in  the  blocks  world 
under  orthography. 


A  “Basic  Shadow  Problem"  is  first  posed,  in  which  there  is  a  single 
light  source,  and  a  single  surface  casts  a  shadow  on  another 
(background)  surface.  There  are  sir  parameters  to  determine:  the 
orientation  (2  parameters)  for  each  surface,  and  the  direction  of  the 
vector  ( 2  parameters)  pointing  at  the  light  source.  If  some  set  of  3 
of  these  are  given  in  advance,  the  remaining  3  can  then  be 
determined  geometrically.  The  solution  method  consists  of 
identifying  "illumination  surfaces"  consisting  of  illumination 
vectors,  assigning  Huffman  Clowes  line  labels  to  their  edges,  and 
applying  the  corresponding  constraints  in  gradient  space. 


The  analysis  is  extended  to  shadows  cast  by  polyhedra  and  curved 
surfaces.  In  both  cases,  the  constraints  provided  by  shadows  can 
be  analyzed  in  a  manner  analogous  to  the  Basic  Shadow  Problem. 
When  the  shadow  falls  upon  a  polyhedron  or  curved  surface, 
similar  techniques  apply.  The  consequences  ot  varying  the 
position  and  number  ol  light  sources  are  also  discussed.  Finally, 
some  methods  are  presented  for  combining  shadow  geomel  /  with 
other  gradient  space  techniques  for  30  shape  inference. 

1.  Intioduction  tv 


1.1  The  Shadow  Geometry  Problem 

When  shadows  are  present  in  an  image,  they  provide  information 
which  can  be  used  for  determining  the  3D  shapes  and  orientations 
of  the  objects  in  the  scene.  The  interpretation  of  shadows  in  an 
image  involves  three  distinct  processes: 

•  Finding  shadow  regions  in  the  image 


Geometric  interpretation  of  shadows  is  also  performed  by  Lowe 
and  Binford  |7],  who  use  shadows  to  delermine  height  in  overhead 
views  of  airplanes.  They  measure  ihe  distance  in  the  image 
between  the  outline  of  an  object  and  the  outline  ot  its  shadow,  and 
use  similar  triangles  to  conclude  that  this  distance  is  proportional  to 
the  height  of  the  object's  edge  above  the  ground.  These 
techniques  have  been  employed  in  manual  photo-interpretation  of 
aerial  photographs  as  well  [14]. 

Waltz  [15]  used  shadows  to  classify  surfaces  into  several 
orientation  categories  depending  upon  the  geometry  of  the 
shadows  in  a  line  drawing.  His  categories  were  qualitative,  such  as 
“front  left"  for  an  approximately  vertical  surface  tipped  to  the  left. 

This  paper  presents  a  theory  describing  the  constraints  that 
shadows  provide  between  surface  orientations  in  line  drawings, 
using  shadow  and  surface  outlines  under  orthographic  projection. 
This  can  be  thought  of  as  a  method  for  achieving  the  same  kind  of 
results  as  Waltz,  but  computing  exact  surface  orientations  rather 
than  simply  categorizing  the  surfaces  into  classes  with  similar 
orientations.  The  theory  presented  here  subsumes  the  "shadow- 
plane"  idea  suggested  by  Mackworth  [8]  as  a  means  for  generating 
gradient-space  constraints  Irom  shadows. 

Shadows  cast  by  and  upon  curved  surfaces  have  been  described 
by.Witkin  [16],  who  derived  equations  relating  surface  curvature  to 
curvature  of  shadow  edges  in  the  image.  The  presentation  in  this 
paper  is  somewhat  different,  discussing  surface  gradient  (local 
orientation)  rather  than  curvature  (rate  of  change  of  orientation). 


•  Solving  the  correspondence  problem  to  determine 
which  object  has  cast  each  shadow  region 

•  Geometrically  deducing  information  about  the  objects 
and  surfaces  involved  on  the  basis  of  the  identified 
object/shadow  pairs 

Techniques  for  the  first  task,  finding  shadow  regions,  have  been 
proposed  by  many  researchers,  usually  by  looking  for  regions  of 
low  intensity  with  approximately  the  same  hue  as  some  neighboring 
region  [10. 12].  Lowe  and  Binford  [7]  and  Within  [17]  have 
proposed  criteria  which  should  be  satisfied  by  edges  of  shadow 
regions:  these  can  be  used  to  suggest  or  try  to  confirm  the 
hypothesis  tost  a  particular  region  is  a  shadow.  Waltz  [15] 
developed  a  method  tor  labeling  lines  in  line  drawings  as  shadow 
edges,  based  on  local  geometric  criteria  at  vertices. 


1 .2  Gradient  Space  and  Line  Labeling 

This  section  presents  an  introduction  to  the  gradient  space  and 
line  labeling  for  readers  who  are  not  already  familiar  with  these 
topics. 

The  coordinate  system  used  in  this  paper  is  illustrated  in  figure  1. 
The  x  and  y  axes  are  aligned  on  the  image  plane  in  the  horizontal 
and  vertical  directions,  respectively:  the  z  axis  points  towards  the 
viewer  (or  camera). 

In  this  paper,  it  w>'l  be  presumed  that  the  point  (x.y.z)  in  the  scene 
corresponds  to  the  point  (x.y)  in  the  image:  this  is  orthography . 
Perspective  projection  is  not  discussed  in  detail  in  this  paper. 

Surface  orientations  can  be  represented  by  points  in  a  plane 
called  the  gradient  space  (figure  2)  [4].  If  a  surface  is  represented 
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Figure  1 :  The  X-Y-Z  Coordinate  System 


by  the  equation 

•z  =  i  (*.  y) 

then  its  gradient  is  represented  by  the  point: 

(p,  g)  =  fdl/dx,  df/dy) 

This  assigns  a  natural  interpretation  to  points  in  gradient  space:  a 
surface  which  is  "tipped"  to  the  right  is  represented  to  a  point  on 
the  right  side  oi  the  origin:  a  surface  tipped  left  has  a  gradient  to  the 
left  of  the  origin.  Similarly,  a  surface  which  is  tipped  up  (or  down) 
has  its  gradient  above  (or  below)  the  origin.  In  figure  2,  the 
gradients  GA  (etc.)  are  shown  for  the  surfaces  SA  (etc.)  in  the  line 
drawing  at  the  right. 


Figu  re  2:  The  Gradient  Space 


The  convex  and  concave  labels  indicate  relationships  between  the 
gradients  of  the  surfaces  which  meet  along  an  edge  (8).  When  two 
surfaces  are  joined  along  a  convex  edge,  their  gradients  lie  along  a 
line  in  gradient  space  which  is  perpendicular  to  the  edge  in  the 
image  (figure  4).  Furthermore,  the  relative  positions  of  the  surface 
gradients  will  be  the  same  as  the  relative  positions  of  the  surfaces 
in  the  image.  When  two  surfaces  meet  at  a  concave  line,  the 
gradients  are  still  on  a  perpendicular  line  in  gradient  space,  but  *'~e 
relative  positions  are  reversed. 


Figure  4:  Line  Labels  and  Gradient  Space  Relationships 

In  general,  if  an  edge  E -  (ax,  Ay)  is  contained  on  a  surface  with 
gradient  G  =  (p,  q),  then  the  edge  corresponds  to  the  three- 
dimensional  vector  (ax,  Ay,  az)  where 

-az  =  G  .E  (1.1) 

In  this  paper,  a  method  is  proposed  for  assigning  Huffman- 
Cfowes  line  labels  to  shadow  making  edges  and  shadow  edges  in  a 
line  drawing,  and  for  using  the  resulting  gradient  space 
relationships  to  determine  surface  orientations. 


2.  The  Basic  Shadow  Problem 

The  Basic  Shadow  Problem  is: 


Before  computing  surface  orientations,  it  is  common  to  attempt 
fo  produce  a  line  drawing  from  an  image,  in  which  all  the  surfaces 
are  outlined.  Huffman  and  Clowes  [4,  2]  showed  that  the  edges 
(line  segments)  in  a  line  drawing  do  not  all  represent  the  same 
three-dimensional  surface  configuration.  The  four  types  of  edges 
they  discovered  are  shown  in  figure  3,  along  with  the  half- planes 
containing  the  surfaces  which  meet  at  each  type  of  edge.  At  a 
convex  edge,  the  surfaces  recede  from  the  viewer  as  you  travel 
farther  from  the  edge.  At  a  concave  edge,  the  surfaces  approach 
the  viewer  as  you  travel  farther  from  the  edge.  At  an  occluding 
edge,  only  one  of  the  two  surfaces  involved  is  directly  visible  in  the 
image.  Waltz  [15)  developed  an  algorithm  for  assigning  these 
labels  to  the  edges  in  a  line  drawing. 
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Figure  3:  Line  Labels  and  Surface  Intersections 


Given  a  line  drawing  such  as  Figure  5,  what 
constraints  exist  between  the  occluding  surface  SQ  and 
the  shaded  surface  Ss? 

For  simplicity,  we  wilt  begin  by  assuming  that  the  surfaces  are  both 
flat,  and  that  orthographic  projection  is  used.  We  will  also,  for  the 
time  being,  presume  that  the  light  source  is  infinitely  far  away:  this 
means  that  all  illumination  vectors  (light  rays  emanating  from  the 
light  source)  are  parallel. 


2.1  Solution  of  the  Problem 

To  show  the  proper  correspondences,  the  edges  and  vertices 
can  be  labeled  as  in  figure  6,  where  edge  Eg1  is  the  shadow  edge 
corresponding  to  EQ1,  E^  is  the  shadow  of  E^,  and  vertex  is 
the  shadow  of  Vwr  ^ 

Consider  the  physical  interpretation  of  edge  f  S1 .  Some  Ught  rays 
just  graze  past  SQ  at  E01,  and  continue  on  to  strike  Ss  along  £gr 
This  set  of  rays  form  a  surface  (a  piece  of  a  plane),  in  fact  the  plane 


1 

t 

4 

i 


91 


voia  £o2 


Figure  6:  Basic  Shadow  Problem  -  Correspondences  Labeled 

containing  f  and  £S1.  This  is  a  surface  consisting  of 
"illumination  vectors”-  call  it  surface  S|(  (Figure  7). 


Figure  7:  Basic  Shadow  Problem  -  Illumination  Surface  1 

Suppose  we  were  to  cut  a  piece  of  cardboard  and  fit  it  into  the 
space  occupied  by  S|t.  Then,  this  cardboard  and  SQ  would  be 
joined  along  EQy  a  convev  edge.  Using  HuffmanClowes  line 
labeling  (4],  this  edge  can  be  given  the  label  ♦  .  Similarly,  ES1  joins 
Ss  and  Slt,  and  is  concave-,  it  receives  the  label 

As  Mackworth  showed  [8],  these  line  labels  can  be  mapped  into 
constraints  in  the  gradient  space.  The  gradient  of  SQ  (Gg)  and  the 
gradient  of  S„  (G(1)  must  be  joined  by  a  line  perpendicular  to  fQ1; 
since  the  label  of  £Q1  is  ♦  ,  GQ  and  G|t  have  the  same  relative 
positions  as  SQ  and  SM  Similarly,  Gn  and  Gg  are  joined  by  a  line 
perpendicular  to  ES|.  with  relative  positions  reversed  because  of 
the  -  label.  These  facts  yield  the  relationship  shown  in  figure  8  in 
the  gradient  space.  However,  we  do  not  yet  know  the  position  of 
this  figure  in  gradient  space,  nor  the  distances  involved;  only  the 
angles  are  known. 


Figure  8:  Gradient  Space  Constraints  from  Illumination  Surface  1 

S„  is  not  the  only  illumination  surface  in  the  Basic  Shadow 
Problem:  the  illumination  surface  S|2  joins  edges  £„  and 
(Figure  9).  Along  E^,  the  -  label  is  assigned;  along  the  - 
label  refers  to  the  junction  of  SQ  and  the  upper  half- plane  of  SB. 
The  gradient  space  constraints  are  shown  in  figure  10.  Note  that  it 
is  possible  for  E^  and  E^  to  be  parallel,  in  which  case  the  two  rays 
shown  in  gradient  space  are  coincident. 

A  third  constraint  In  the  gradient  space  arises  from  the  fact  that 
an  edge  £„  can  be  drawn  joining  VQ12  and  (Figure  11).  This 
edge  lies  In  a  line  which  passes  through  the  light  source,  since  V8,2 
is  the  shadow  of  VQ12.  The  vector  f  pointing  at  the  light  source  nan 
be  represented  in  gradient  space  by  a  point  G,,  which  Is  the 
gradient  of  thoes  surfaces  whose  normal  vectors  are  parallel  to  /. 
Since  £„  Sea  in  the  projection  of  this  vector  onto  the  image  plane, 
the  point  G,  must  He  along  a  line  in  gradient  space,  passing  through 


Figure  9:  Basic  Shadow  Problem  -  Illumination  Surface  2 
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Figu  re  1 0 ;  Gradient  Space  Constraints  From  Ilium.  Sur 

the  origin,  and  parallel  to  £ M  (Figure  12).  It  is  not  known,  hov, 
how  far  this  point  G(  is  from  the  origin;  suppose  this  is  determined 
somehow  (as  described  below),  and  call  the  distance  k.  It  should 
be  noted  that  k  represents  the  relative  change  in  z  with  a  change  in 
7  or  y  along  the  illumination  vector.  It  is  defined  by  this  equation: 

*  =  sqrt  (A*2  +  Ay2)  /  A*  =  ||£|1||  /  az  (2.1) 


Figure  1 1 :  Basic  Shadow  Problem  -•  Illumination  Vector 


Figure  1 2:  Gradient  Space  Constraints  From  Illumination  Vector 

The  line  perpendicular  to  and  located  at  a  distance  1  /* 
from  the  origin,  represents  the  locus  of  the  gradients  of  all  planes 
which  contain  the  Illumination  vector  I.  This  is  the  set  of  all 
illumination  planes,  and  in  particular  contains  both  S„  and  SQ; 
thus,  G„  and  G^  are  points  on  the  line  L^.  This  property 
subsumes  the  properly  of  G„  and  Gy  that  they  must  be  joined  by  a 
line  perpendicular  to  £„,  since  fM  can  be  given  the  label  *  or  - 
(depending  on  which  half- planes  the  line  label  refers  to). 
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The  line  L  „  is  the  same  as  the  terminator  described  by  Horn  in 

mum  j 

(3J.  It  separates  the  gradient  space  into  two  half  planes:  the  halt 
plane  containing  G (  represents  the  gradients  of  all  planes  that  will 
receive  illumination,  while  the  other  half  plane  contains  the 
gradients  of  sell  shadowed  surfaces  (facing  away  from  the  light 
source). 

This  is  the  extent  of  the  information  available  from  the  line 
drawing  in  figure  5.  Since  each  gradient  is  an  ordered  pair  (p,  q), 
the  problem  has  six  parameters  to  be  computed: 


must  be  GM  Through  GM,  draw  a  line  perpendicular  to 
£0).  Gq  must  lie  on  this  line. 

3.  From  Gs,  draw  a  line  perpendicular  to  £  g2  Where  it 
intersects  L  „  will  be  G,„.  From  there,  draw  a  line 

ilium  12 

perpendicular  to  c’02.  Since  GQ  must  lie  on  this  line, 
the  intersection  of  this  line  with  the  final  line  from  step 
(2)  above  must  be  GQ. 

In  1 13],  the  solution  for  the  Basic  Shadow  Problem  is  shown  to  be 
of  the  form: 


•  (2  parameters)  GQ,  the  gradient  of  SQ 

•  (2  parameters)  Gg,  the  gradient  of  Sg 

•  (2  parameters)  G(,  the  direction  of  the  light  source. 

From  the  Basic  Shadow  Problem  geometry,  three  constraints  are 
provided: 

«  The  angle  G0-G„Gg,  which  comes  from  the  angle 
COl'£S1 

•  The  angle  GQG|2GS.  which  comes  from  the  angle 
between  tQ2  and 

•  The  direction  of  the  line  G|kjm  (containing  Gn  and  Ge), 
which  comes  from  the  direction  of  E,,. 

We  would  therefore  expect  that  three  parameters  must  be  given  in 
advance,  and  the  other  three  can  be  computed  from  the  geometry. 

Let  us  suppose,  for  example,  that  the  value  k  is  given  (the  relative 
depth  component  of  the  direction  of  the  light  source),  and  that  Gg  is 
Known  (the  relative  orientation  of  the  background  with  respect  to 
the  camera)  The  construction  in  the  gradient  space  for  computing 
G0  proceeds  as  follows  (Figure  13). 

cEs2  iEoi 


Figure  13:  Solution  to  Basic  Shadow  Problem 

1 .  Draw  the  line  parallel  to  E„  through  the  origin.  Since  k 
is  known,  G.  and  LJ(ura  can  be  found. 

2.  Plot  Gs.  which  was  given.  Through  this  point,  draw  a 
line  perpendicular  to  £gv  Where  it  intersects  LAum 
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where  O,  R,  S,  T,  U,  and  V  can  be  computed  from  measurements  of 
the  line  drawing  (or  image). 


2.2  Relationships  Among  the  Parameters  Supplied  in 
Advance 

In  the  example  above.  Gs  and  k  were  needed  before  the 
construction  could  take  place.  In  practice,  a  program  for  a  specific 
application  may  not  be  able  to  compute  these  particular 
parameters. 


ft  is  possible  to  begin  the  construction  with  any  three  of  the  six 
pieces  of  information  specified  in  advance,  as  long  as  none  are 
redundant  with  each  other,  and  none  are  redundant  with  the 
direction  of 


It  is  possible,  or  perhaps  likely,  that  a  given  line  drawing  will 
include  the  edge  between  SQ  and  Sg,  as  in  figure  14.  An 
interesting  Question  arises  as  to  whether  this  provides  some 
additional  constraint,  which  might  perhaps  relax  the  requirement 
that  three  pieces  of  information  be  provided  in  advance. 


Eo2 


Figure  1 4:  Basic  Sh-iow  Problem  -■  Edge  Provided 


The  edge  turns  out  to  be  redundant  with  £Q2  and  in  the 
sense  that  given  the  latter,  the  former  can  be  constructed,  and  vice 
versa  Suppose  we  are  given  fQ2  and  £g r  These  represent  the 
intersections  (in  the  scene )  of  planes  SQ  and  Sl2,  and  Sg  and  S|2, 
respectively.  Now,  either  these  two  lines  intersect  or  they  do  not. 
Suppose  they  intersect  in  a  point.  Call  it  since  it  is  contained 
in  surfaces  SQ,  Sg.  and  S)2-  T!r  .  point  is  contained  in  both  SQ  and 
Sg,  as  is  point  V(3SI  which  is  given  in  the  line  drawing.  Therefore, 
the  line  Egg  must  pass  through  these  points.  On  the  line  drawing, 
find  the  intersection  of  £Q2  and  E^.  Draw  the  line  joining  this  point 
to  Vgg,:  this  is  (Figure  15), 


Eo2 
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Now.  suppose  that  the  two  lines  fQ?  and  Fs?  do  not  intersect 
anywhere.  Then  there  is  no  point  VQS2  contained  in  all  three 
surfaces  SQ.  Sg.  and  S12.  So.  EQS  cannot  intersect  either  EQ2  or 
fgj  Since  it  is  coplanar  with  these  (on  surfaces  SQ  and  Sg, 
respectively),  it  must  be  parallel  to  both.  Edge  tQS  can  therefore  be 
drawn  through  i/Qsl .  parallel  to  EQ2  (and  f$2). 

By  this  reasoning.  fQS  can  be  constructed  from  EQ2  and  f^. 
Similarly,  if  EQS  is  given,  either  of  EQ2  and  can  be  calculated 
from  the  other,  to  provide  the  geometric  constraint  described  above 
for  the  solution  of  the  Basic  Shadow  Problem  Of  course,  the 
solution  can  also  proceed  directly  using  the  label  -  on  with 
identical  results. 

The  solution  of  this  problem  should  be  compared  with  the 
solution  to  the  problem  if  there  are  no  shadows  -  if  just  SQ  is  given, 
joined  to  Sg  along  edge  Eog.  Here,  there  are  four  parameters  (G0 
and  Gg)  to  compute,  and  one  constraint  from  the  image  (Eog).  so 
three  pieces  of  information  are  still  needed  in  advance.  With 
shadows,  the  same  number  of  a  priori  parameters  are  needed,  but 
one  of  them  can  be  a  description  of  the  light  source  position 
instead  of  a  description  of  a  surface  orientation.  The  significance 
of  shadows  is  that  they  allow  information  about  the  light  source  to 
be  used  to  solve  the  problem  as  a  substitute  for  information  about 
the  surface  orientations  themselves. 

It  has  not  been  assumed  in  this  discussion  that  surfaces  SQ  and 
Sg  must  touch.  In  practice,  the  Basic  Shadow  Problem  arises  any 
time  there  are  two  surfaces  which  provide  two  shadow  edge  pairs 
and  an  enclosed  illumination  vector  (Figure  16).  Any  additional 
shadow  edge  pairs  on  these  two  surfaces  will  be  redundant,  as  will 
any  visible  edges  along  which  these  two  surfaces  intersect  directly. 


Figu  re  1 6:  Occurence  ot  the  Basic  Shadow  Problem 

2.3  Varying  the  Location  of  the  Light  Source 

When  the  light  source  is  in  front  of  the  camera  (i.e,  in  the  scene, 
where  it  might  even  appear  in  the  image)  and  infinitely  far  away,  the 
Basic  Shadow  Problem  takes  the  form  shown  in  figure  17.  In  this 
case,  the  first  illumination  surface  S(|  joins  edges  £QI  and  Eg1, 
giving  both  of  these  edges  -  labels  Illumination  surface  Sl2  joins 
EQ2  and  fgj.  At  Egj,  the  label  is  clearly  -  To  label  EQ2,  it  is 
necessary  to  exttnd  S12  above  this  edge,  and  apply  the  label  to  Sq 
and  the  upper  half  plane  of  Sl2.  The  label  will  then  be  +  . 

The  vector  pointing  toward  the  light  source  does  not  intersect  the 
plane  z  -  1,  but  the  vector  pointing  away  from  the  light  source 
(toward  the  camera)  does  This  has  the  effect  of  placing  the  point 
G,  in  the  gradient  space  on  a  line  parallel  to  edge  E(1  passing 
through  the  origin  as  before,  but  on  the  half  line  towards  surface  Sg 
instead  of  towards  surface  SQ.  Also,  while  the  redundancy  of  edge 
Eqs  is  also  the  same  as  in  the  Basic  Shadow  Problem,  edges  f03 
and  fgj  are  redundant  with  edges  EQ2  and  E^.  This  can  be  easi'y 
seen,  smce  edge  can  be  calculated  from  the  intersection  of  f01 


Eo2  + 


Figure  1 7 :  Light  Source  In  Front  ol  Camera.  Infinitely  Far  Away 
and  £g1  and  the  intersection  of  EQ3  and  E^;  since  edge  f  Qg  is 
known  to  be  redundant  with  EQ2  and  Eg },  so  must  be  E03  and  E^ 

If  the  light  source  is  behind  the  camera  but  below  it,  and  infinitely 
far  away,  then  the  geometry  is  as  shown  in  figure  18.  In  this  case, 
the  only  difference  from  the  Basic  Shadow  Problem  is  that  edge  EQ2 
receives  the  label  ♦  instead  of  the  labels  of  edges  EQV  Fg1,  E^, 
and  Eq g  (if  present)  will  be  the  same  as  previously  described. 


Es2 


Figure  18:  Light  Source  Behind  and  Below  Camera.  Infinitely  Far 


If  the  light  source  is  a  point  not  infinitely  far  away,  then  all 
illumination  vectors  will  converge  at  the  light  source  instead  of 
being  parallel  (Figure  19).  Only  two  of  the  preceding  arguments 
need  to  be  changed  in  this  case.  The  first  difference  is  that  the 
value  A  is  dependent  upon  the  particular  illumination  vector  used, 
and  each  illumination  vector  will  have  its  own  value' of  A  and  its  own 
line  of  illumination  surface  gradients  f-a)tim. 
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Flgu  re  1 9:  Point  Light  Source  at  Finite  Distance 


The  second  change  is  that  edges  E—,  and  E^  are  no  longer 
interchangeable  with  E^  or  with  EQ2  and  E^.  The  new  information 
is  actually  provided  not  by  the  angle  between  the  edges  E^  and 
E^,  but  by  the  new  illumination  vector  E|2  seen  between  vertices 
Vqjj  and  Vgjg.  This  is  shown  in  figure  19  for  one  case  (light  source 
below  and  behind  camera):  similar  line  labels  and  reasoning  hold 
for  the  other  cases  presented  previously. 
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In  this  arrangement,  the  exact  position  of  the  light  source  can  be 
calculated  The  lines  £|t  and  £|?  must  intersect  (in  the  scene);  the 
light  source  is  located  at  the  point  of  intersection.  Under 
orthography,  as  we  are  assuming  here,  the  x  and  y  coordinates  of 
the  light  source  will  be  the  same  as  the  x-y  coordinates  of  the 
intersection  of  the  lines  in  the  image.  So.  these  coordinates  can 
easily  be  found.  The  relative  2  coordinate  is  then  found  using  the  k 
value  for  either  of  these  vectors  (£|(  or  £|2),  using  the  definition  of  k 
presented  above  in  equation  (2.1):  if  (ax.  ay.  a 2)  is  an  illumination 
vector  from  an  object  vertex  to  the  light  source  (such  as  £(|  or  £(2). 
then  ax  and  ay  can  be  measured  in  the  image,  and 
ax  =  sqrt(ax2  +  ay2)  /  k 

It  can  be  determined  from  the  line  drawing  whether  the  light 
source  is  in  tact  infinitely  lar  away:  it  two  illumination  vectors  (such 
as  £  and  £  )  intersect,  then  the  light  source  is  at  a  finite  distance, 
and  all  illumination  vectors  in  the  image  must  intersect  at  the  same 
point.  If  any  two  illumination  vectors  are  parallel,  then  all 
illumination  vectors  are  parallel  and  the  light  source  is  infinitely  far 
away.  These  observations  can  be  used  to  arrive  at  constraints 
between  various  simple  shadow  problems  that  arise  in  different 
parts  of  the  same  image,  involving  different  objects  and  surfaces. 


2.4  Changing  the  Number  of  Light  Sources 

It  is  possible  that  several  light  sources  will  be  present,  as  in  figure 
20.  In  this  case,  each  light  source  produces  two  parameters  in  the 
problem  (the  direction  of  illumination),  and  adds  two  image 
constraints  (an  illumination  vector  and  one  non-redundant  shadow 
edge  pair).  The  number  of  a  prion  parameters  needed  will  be  the 
same,  regardless  of  how  many  light  sources  are  present. 


•  edges  are  redundant  with  Eos 

Figure  20:  Basic  Shadow  Problem  With  Multiple  Light  Sources 
However,  for  each  light  source,  one  of  the  a  priori  parameters 
may  be  the  value  k  for  that  light  source,  based  on  knowledge  of  the 
three-dimensional  direction  of  illumination.  In  general,  if  n  light 
sources  are  present  and  the  value  of  k  is  known  for  each,  the 
problem  has  2n  +  4  parameters,  the  image  provides  3n  + 1 
constraints,  and  3-n  parameters  are  needed  in  advance.  Thus, 
shadows  allow  you  to  use  a  priori  knowledge  about  light  source 
positions  instead  of  a  priori  knowledge  about  surface  orientations 
when  computing  the  gradients  of  the  visible  surfaces. 


In  figure  21,  there  are  no  light  sources  or  shadows.  There  are  4 
parameters  to  compute  (the  gradients  of  the  two  surfaces).  An 
image  constraint  will  be  provided  in  this  case  only  if  the  two 
surfaces  SQ  and  Ss  touch  along  edge  f^;  if  they  do  not,  then  an 
extra  a  priori  parameter  wilt  be  needed  (i.e.  4  instead  of  3). 


So 


EoS* 


Figure  21:  Two  Surfaces  With  No  Light  Source 


3.  Polyhedral  Shadow  Geometry 


3.1  Shadows  Falling  On  Polyhedra 
The  shadow  of  SQ  may  fall  on  two  surfaces,  SR  and  ST, 
connected  by  an  edge  £ST  (Figure  22).  In  this  case,  the  first 
illumination  surface  Sn  contains  edges  £0).  £  and  £Tr 
Illumination  surface  S|?  contains  edges  f  0J  and  f  Edge  £  is  an 
illumination  vector,  joining  vertices  vQl2  and  VS)J. 


Figure  22:  Shadow  Falling  On  Two  Surfaces 


In  this  figure,  a  Basic  Shadow  Problem  can  be  solved  using 
surfaces  SQ  and  Ss.  In  addition,  there  are  two  new  parameters  to 
compute  (Gt).  and  two  new  constraints  (from  edges  fST  and  £Ti). 
Those  new  constraints  can  be  used  to  compute  GT  once  the  Basic 
Shadow  Problem  has  been  solved,  using  these  relations: 


£st  e  ST,  Ss  -axST  =  Gt  £st  =  Gs  £st 


£t,  e  ST,  Sh 


-axt 


gt  eti 


GneTi 


The  edge  EST  is  labeled  -  it  the  shadow  edge  E  bends  toward 
fo,  from  £(1,  and  ♦  if  it  bends  away  from  EQy  The  complete 
solution  of  this  problem,  like  the  Basic  Shadow  Problem,  requires 
that  three  pieces  ot  information  be  supplied  in  advance. 


Figure  23:  Shadow  Falling  On  Many  Surfaces 

This  solution  technique  can  be  generalized  to  cases  such  as 
figure  23,  in  which  there  are  several  shaded  surfaces.  If  there  are  n 
shaded  surfaces  which  Intersect  the  shadow  edge  with  no 
discontinuities  in  the  shadow  edge,  the  problem  wlH  have  a  total  of 
2n  +  4  parameters:  2n  for  the  gradients  of  the  shaded  surfaces,  2  for 
G0,  and  2  for  G,.  The  image  witl  supply  2n  + 1  constraints;  three 
parameters  must  be  given  in  advance. 

It  is  possible  for  the  shadow  edge  to  exhibit  discontinuities  when 
the  shadow  edge  falls  across  occluding  edges,  as  in  figure  24, 
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Figure  24:  Shadow  Edge  With  Discontinuites 

The  solution  method  is  exactly  as  before,  but  this  time  there  will 
be  no  constraint  between  surfaces  Sg  and  Sr  since  edge  fST  has 
been  replaced  by  edge  £TX  which  provides  no  constraint  between 
Ss  and  Sr  Therefore,  the  image  provides  one  less  constraint,  and 
one  additional  non-redundant  parameter  must  be  supplied  in 
advance  in  order  to  compute  all  the  surface  orientations.  Of 
course,  the  gradient  of  surface  Sx  cannot  be  computed,  since  Sx  is 
not  visible  in  this  image. 

It  is  also  the  case  that  the  edge  for  (between  the  occluding 
surface  and  one  shaded  surface)  is  non-redundant  if  there  are  any 
discontinuities  along  the  shadow  edge  caused  by  illumination 
surface  SM  (as  in  figure  24).  Therefore,  if  this  edge  is  present,  the 
image  provides  an  additional  constraint. 

3.2  Shadows  Cast  by  Simple  Polyhedra 

When  a  shadow  is  cast  by  a  polyhedron  as  in  figure  25,  each 
shadow-making  edge  (fpx,  E^)  must  be  the  intersection  of  an 
illuminated  surface  and  a  self  shadowed  surface  of  the  polyhedron. 
In  the  figure,  SQ  is  illuminated  and  Sp  is  self-shadowed.  The  edge 
Egp  between  them  is  a  shadow-making  edge,  and  corresponds  to 
Shadow  edge  £S).  Illumination  surface  S(1  contains  these  two 
edges.  Similarly,  it  can  be  concluded  that  edge  £px  is  a  shadow¬ 
making  edge,  and  must  correspond  to  shadow  edge  Es  (via 
illumination  surface  S|}). 


IWT. 


Figure  25:  Shadow  Cast  By  Simple  Polyhedron 

It  can  be  deduced  from  the  above  observations  that  whatever 
surface  intersects  Sp  along  edge  £px  must  be  illuminated.  It 
cannot,  however,  be  concluded  that  the  surface  containing  edge 
epx  8,80  conta‘n#  edge  Fox  For  this  reason,  no  strong  statements 
can  be  made  about  the  surfaces  that  are  not  visible  in  the  image. 

In  the  figure,  a  Basic  Shadow  Problem  exists  involving  surfaces 
Sp  and  S-.  The  edge  £  is  therefore  redundant  with  the  two 
shadow  edge  pairs  (£  and  £px  and  E  ).  This  is  important, 
since  «  is  typically  difficult  to  resolve  details  such  as  edge  £„ 
within  shaded  portions  of  the  image  [7],  ™ 

When  the  basic  problem  has  been  solved,  the  gradients  of 
surfaces  Sp  and  S8  will  be  known.  The  gradient  of  S_  can  then  be 
calculated  by  using  the  constraints  provided  by  edges  (with 
surface  Sp)  and  (with  surface  SB) 


Little  useful  information  is  provided  by  edge  £ox,  since  it  borders 
on  only  one  visible  or  constructible  surface  (SQ)  Edge  £px,  on  the 
other  hand,  is  very  important,  since  it  borders  on  two  surfaces 
(visible  surface  Sp  and  the  illumination  surface  S|2). 

In  this  problem,  there  are  eight  parameters  to  be  computed  (the 
gradients  of  surfaces  SQ.  Sp,  and  Ss,  and  the  direction  of  the  light 
source  G().  The  image  provides  five  constraints  (two  from  the 
shadow  edge  pairs  ^op-Eg,  and  £px-fS2.  one  from  the  illumination 
edge  £|r  and  two  Irom  the  edges  E^  and  f^).  Therefore,  three 
parameters  must  be  provided  in  advance  in  order  to  perform  the 
computation. 

If  the  figure  were  drawn  with  no  shadows,  there  would  be  six 
parameters  altogether  (the  gradients  of  the  three  surfaces),  and 
three  constraints  in  the  image  (from  edges  f^,  E^,  and  E^). 
Three  parameters  would  be  required  in  this  case,  also.  As  in  the 
Basic  Shadow  Problem  itself,  the  shadow  of  a  polyhedron  does  not 
provide  additional  constraints:  it  merely  allows  you  to  substitute 
information  about  the  light  source  for  a  priori  information  about  the 
surface  orientations  themselves,  and  allows  you  to  utilize  easy-to- 
find  shadow  edges  instead  of  hard-to-find  details  within  shaded 
areas  of  the  image. 

The  above  method  of  solution  also  applies  when  the  light  source 
is  in  a  different  position  as  in  figure  26,  which  illustrates  two 
illuminated  surfaces  of  a  polyhedron. 


Sp/Eop  O 
/  so  > 


Figure  26:  Light  Source  In  a  Different  Position 

3.3  Shadows  Cast  by  Complex  Polyhedra 
Suppose  we  add  an  additional  self-shadowed  surface  to  figure 
25,  as  in  figure  27.  In  this  figure,  both  SA  and  Sp  are  self-shadowed. 
We  will  suppose  that  the  new  surface  SA  adjoins  a  shadow-making 
edge  f  AQ.  (If  the  new  surface  SA  does  not  adjoin  a  shadow-making 
edge,  it  wilt  be  buried  in  the  middle  of  the  shaded  area  and  will  have 
no  effect  on  the  shape  of  the  shadow.) 


Figure  27:  Polyhedron  With  Two  Self-Shadowed  Surfaces 
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Two  new  parameters  are  present  in  the  system:  the  gradient  GA 
of  the  new  surface  SA.  The  image  provides  two  new  constraints 
that  can  be  used  to  solve  lor  these  two  parameters:  the  shadow 
edge  pair  £AX-f  53.  and  the  edge  £AO  between  surfaces  SA  and  SQ. 
So,  three  parameters  are  still  required  in  advance  to  solve  the 
system  completely. 

The  edge  £AS  is  redundant  with  the  shadow  edge  pair 
when  shadows  are  present.  One  of  the  two  edges  £AP  and  £pg  is 
needed,  along  with  t^,,  to  determine  the  gradient  of  surface  Sp. 
Thus,  two  of  the  edges  fgp.  fAS,  and  fpg  are  redundant,  and  only 
one  is  needed.  Since  these  edges  all  lie  in  the  shadowed  area  of 
the  image,  they  will  be  difficult  to  extract  reliably  [11].  Shadows 
reduce  the  need  to  find  edges  within  shadowed  areas  of  the  image. 

When  the  basic  figure  (Figure  25)  is  modified  by  adding  an 
illuminated  surface  instead  of  a  sell  shadowed  surface,  a  line 
drawing  such  as  figure  28  is  the  result.  In  this  figure,  surfaces  SA 
and  SQ  are  illuminated,  while  Sp  is  self  shadowed.  (Again,  if  the 
surface  does  not  adjoin  a  shadow-making  edge,  there  will  be  no 
effect  on  the  shape  ol  the  shadow  and  the  consequent  inferences 
to  be  made  from  shadow  geometry.  Therefore,  we  will  assume  that 
the  new  surface  SA  does  adjoin  a  shadow-making  edge  fAp.) 


/ 
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Figure  28:  Polyhedron  With  Two  Illuminated  Surfaces 

The  reasoninq  here  is  analogous  to  the  case  of  an  additional 
self-shadowed  surface:  two  new  parameters  are  needed  (G A),  and 
there  are  two  new  constraints  with  shadows  (the  pair  Fpx-FS3  and 
the  edge  £AQ).  and  two  new  constraints  with  no  shadows  (edges 
£a0  and  £Ap).  In  any  case,  three  parameters  will  be  required  in 
advance. 

It  is  possible  that  additional  a  priori  parameters  will  be  needed  in 
pathological  cases.  Figure  29  depicts  an  object  with  a  surface 
adjoining  the  shadow-making  edge  which  is  not  visible  in  the  image 
(at  £px).  Here,  an  additional  a  priori  parameter  will  be  needed  to 
determine  the  gradient  of  surface  SR.  The  additional  parameter  is 
needed  because  edge  £_x  provides  no  constraint  between 
surfaces  SQ  and  SR.  this  situation  is  analogous  to  the 
discontinuities  in  the  shadow  edge  discussed  previously. 

Another  circumstance  requiring  additional  a  priori  parameters  is 
shown  in  figure  30.  Here,  vertex  is  not  trihedral  -  there  are 

four  surfaces  meeting  at  that  point  (S0,  Sp,  SQ,  and  S_).  This  adds 
one  degree  of  uncertainty  involving  the  gradients  of  surfaces  SQ 
and  SR:  one  additional  a  priori  parameter  is  needed  to  solve  this 
problem. 


Figure  29:  Additional  Parameter  Needed  for  Hidden  Surface 


Figure  30:  Additional  Parameter  Needed  for  Non-Trihedral  Vertex 

3.4  The  General  Solution  For  Polyhedral  Shadow  Geometry 
The  results  of  the  two  previous  extensions  can  be  directly 
combined.  In  these  arguments,  it  has  never  been  assumed  that  the 
shadow  edge  E^  and  the  corresponding  shadow-making  edge 
(fAX  or  £pX>  meel  al  a  vertex.  Therefore,  the  results  apply  without 
change  to  line  drawings  with  additional  hidden  surfaces,  such  as 
figure  31.  In  this  figure,  there  is  no  strong  information  to  be 
obtained  from  shadow  edge  f^. 


Flgu  re  3 1 :  Polyhedron  With  Additional  Invisible  Surfaces 

Suppose  the  image  depicts  i  illuminated  surfaces  and  s  self- 
shadowed  surfaces  along  the  shadow-making  edges  of  a 
polyhedron,  casting  a  shadow  whose  corresponding  edge 
intersects  n  surfaces  of  another  polyhedron  exhibiting  d 
discontinuities,  with  h  hidden  shadow-making  surfaces  and  /  non- 
trihedral  vertices. 


•  Tile  problem  has  2t  *  2s  +  2n  +  2  parameters: 

c  2i  lor  the  gradients  of  the  /  illuminated  surfaces 
o  2s  for  the  gradients  of  the  s  sell  shadowed 
surfaces 

o  2 n  for  the  gradients  of  the  n  background 
surfaces 

o  2  for  the  direction  of  illumination,  G, 

•  Th.  image  provides  2i  +  2s  +  2n  d-h-t-1  constraints: 

o  1  from  the  illumination  vector 
o  2  shadow- making  /shadow  edge  pairs  used  to 
solve  the  Basic  Shadow  Problem  at  one  vertex 
o  i  +  s-2  additional  shadow  making  edges 
o  n-1  additional  shadow  edges 
o  i  +  s-h-f-1  non-redundant  edges  between  visible 
surfaces  of  the  polyhedron  casting  the  shadow 
o  1  non  redundant  edge  between  the  shadow¬ 
making  polyhedron  and  the  shaded  polyhedron 
on-d-1  edges  at  intersections  of  visible  shaded 
surfaces 

•  Therefore,  3  +  d  +  h  +  r  parameters  must  be  provided  a 
priori: 

o  3  for  the  solution  of  the  Basic  Shadow  Problem 
o  d  to  compensate  for  the  d  discontinuities  in  the 
shadow  edge  due  to  invisible  shaded  surfaces 
o  h  to  compensate  lor  the  h  hidden  shadow¬ 
making  surfaces 

o  t  to  compensate  for  the  f  non-trihedral  vertices 

Without  shadows,  the  problem  contains  2i  +  2s  *2n  parameters, 
the  image  supplies  2i*2s*nd-h-t -2  parameters,  and 
n*d*h*  f  +  2  parameters  must  be  supplied  before  the 
computation. 

If  i)1  or  5>1.  an  additional  illumination  vector  can  be  used  to 
determine  the  exact  position  of  the  light  source. 

The  contribution  of  shadows  for  computing  surface  orientations 
from  polyhedral  line  drawings  can  be  summarized: 

•  Shadows  provide  an  increasing  amount  of  information 
when  the  shadow  edge  intersects  many  visible, 
differently  oriented  surfaces  of  the  backgiound. 

•  Shadows  allow  you  to  substitute  one  parameter 
describing  the  direction  of  illumination  to  replace  one 
parameter  describing  a  surface  orientation  before 
performing  the  required  calculations. 

•  Shadows  allow  you  to  substitute  (usually)  highly  visible 
shadow  edges  and  shadow  making  edges  for  many  of 
the  unreliable  edges  within  shaded  portions  of  the 
image,  while  providing  the  same  amount  of  information. 

In  addition,  when  several  shadow  problems  appear  in  different 
portions  of  the  same  image,  they  share  some  constraints.  For 
example,  suppose  several  polyhedral  blocks  are  scattered  over  a 
single  surface.  If  the  gradient  of  the  surface  and  the  direction  of 
illumination  are  known,  then  three  constraints  are  provided  for  each 
of  the  shadow  problems,  This  will  allow  the  exact  solutions  to  be 
found  for  all  the  problems,  if  no  shadow  edge  discontinuities  or 
non-trihedral  vertices  are  present. 


4.  Shadows  Involving  Curved  Surfaces 

In  this  chapter,  the  involvement  of  curved  surfaces  in  shadow 
geometry  will  be  explored.  Whether  the  curvature  lies  in  the 
occluding  surface  (object)  or  the  shaded  surface,  additional 
information  is  required  to  determine  the  exact  surface  orientation 
along  the  shadow-making  arc  or  the  shadow  edge  arc. 

Witkin  [16]  has  also  used  shadows  to  determine  curved  surface 
orientation.  He  developed  a  relation  between  the  curvature  of  a 
shadow  edge  in  the  scene  and  the  curvature  of  the  shadow  edge  in 
the  image,  then  derived  surface  orientations,  using  surface  texture 
gradients  to  provide  the  additional  constraint  necessary.  The 
discussion  below  differs  from  Witkin's  in  that  surface  orientation 
rather  than  curvature  (rate  of  change  of  orientation)  is  the  basis  of 
the  theory. 

For  discussing  curved  surfaces,  it  is  necessary  to  generalize  the 
relation  between  line  labels  and  surface  gradients.  Suppose  two 
(possibly  curved)  surfaces  S.  and  S„  intersect  along  arc  £ 
(Figure  32).  8  « 


Figu  re  32 :  Curved  Surfaces  Intersecting  Along  an  Arc 
The  surfaces  are  defined  by 

fA(*.y)  SB:  -z  «  tg  (x,  y) 


sa:  -* 


At  a  pointy  on  fAB, 

-z  *  f,  (x.  y)  *  Ig  (*,  y) 
Differentiating  by  x  and  multiplying  by  Ax, 


dz  dy 

-Ax  —  =  AxGA  (1,— ) 
dx  *  dx 


dy 

AXGB  (t.-, 


where  Ga  and  Gg  are  the  gradients  of  SA  and  SB  at  VAB,  and  £  « 
(Ax.  Ay)  is  a  vector  tangent  to  £AB  at  VAB  in  the  image, 
corresponding  to  the  three  dimensional  vector  (Ax.  Ay,  az)  in  the 
scene. 

Since 


dz  dy 

Az  *  ax  —  and  Ay  *  ax  — , 
dx  dx 

we  have 

-Az  «  GA  '  (Ax,  Ay)  =  Gb  ’  (Ax,  Ay) 

-<V£ 

This  is  the  curved -surface  analogue  of  the  relation  -az  »  Q  ■  e 
described  earlier  for  planar  surfaces:  the  planar-surface  edge  E  is 
replaced  by  the  tangent  vector  £  to  the  arc  of  intersection  of  two 
curved  surfaces.  As  a  consequence,  GA  and  Ge  lie  along  a  line  in 
gradient  space  perpendicular  to  the  tangent  to  the  arc  of 
intersection  in  the  image. 
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4.1  Curvature  in  the  Shaded  Surface 

Suppose  a  flat  surface  is  casting  a  shadow  on  a  curved  surface, 
as  in  figure  33.  Here,  vertex  Vg)2  is  the  shadow  of  vertex  VQ12. 
Surface  S(),  the  first  illumination  surface,  casts  the  shadow  of  edge 
fQl  on  arc  £ S|  of  the  curved  surface  Sg.  Surface  S|2  similarly  casts 
the  shadow  of  edge  EQ2  on  arc 


Figu  re  33:  Shadow  Cast  On  a  Curved  Surface 


Suppose  vsx  is  an  arbitrary  point  on  the  arc  £g|  Can  we 
determine  the  gradient  Gx  of  Sg  at  this  point? 

Arc  f  S)  is  the  arc  of  intersection  between  the  curved  surface  Sg 
and  the  illumination  surface  S|t  (defined  by  edge  £Q1  of  surface 
Sq)  Therefore,  as  previously  explained,  gradients  Gx  (of  Ss  at  V 
and  G,  (of  S„)  must  lie  along  .a  line  in  gradient  space 
perpendicular  to  the  tangent  line  £X)  to  £gi  at  v^.  This  constraint 
is  illustrated  in  figure  34. 


Figure  34:  Gradient  Space  Constraint  Between  Gx  and  G(1 

This  reasoning  can  be  used  to  find  the  two  tangent  lines  at  vertex 
VgiJ,  and  use  them  in  a  Basic  Shadow  Problem  with  edges  £0l  and 
£Q2  of  the  occluding  surface  SQ.  If  Sv  is  the  plane  tangent  to  Sg  at 
Vg)2.  the  Basic  Shadow  Problem  actually  involves  surfaces  Sv  and 
SQ  For  this  computation,  three  a  priori  parameters  wifi  be  required, 
and  the  gradients  G0.  Gy.  G„,  G|2,  and  G,  will  be  computed. 

It  is  not  possible  to  compute  the  gradients  Gx  (and  GY  at  other 
vertices  vgy  as  in  figure  33,  etc.)  without  additional  information. 
However,  it  is  possible  to  establish  a  one  dimensional  constraint  on 
each  such  gradient.  Since  the  gradient  Gn  of  illumination  surface 
S„  was  computed  as  part  of  the  Basic  Shadow  Problem  at  vertex 
vsi2'  ,he  cons,ra'n,s  provided  by  the  tangent  lines  £x,  and  £yi 
cause  gradient  space  constraints  as  shown  in  figure  35.  Similar 
reasoning  allows  constraints  on  the  gradients  at  points  along  arc 
£32  to  be  computed,  using  the  gradient  Gl2  of  illumination  surface 

sB. 


Figure  35:  Gradient  Space  Constraints  On  Tangent  Planes  To  Ss 


For  an  investment  of  three  parameters  given  in  advance,  then, 
the  gradients  of  SQ  and  Sy  can  be  computed,  as  well  as  a  one¬ 
dimensional  constraint  on  the  gradient  for  each  point  along  arcs 
fst  and  £S2‘  Additional  constraint  for  the  gradients  along  these 
arcs  might  come  from  another  source  such  as  Horn’s  "shape  from 
shading"  technique  [3]  or  a  prion  knowledge  of  the  shape  of  the 
object  bounded  by  surface  Sg. 

In  this  shadow  problem,  if  another  illumination  vector  is  available 
(possibly  from  the  shadow  of  another  vertex  of  Sq),  the  exact 
position  of  the  light  source  can  then  be  determined. 

The  information  available  from  using  shadows  in  this  problem  is 
not  redundant  with  information  available  from  the  same  line 
drawing  without  shadows. 

4.2  Shadows  Cast  By  Curved  Surfaces 

When  a  curved  object  casts  a  shadow  on  a  flat  surface  as  in 
figure  36,  the  shadow  edge  £lg  corresponds  to  the  shadow  of  the 
"arc  of  extinction"  £|0  which  divides  surface  SQ  into  an  illuminated 
part  and  a  self-shadowed  part.  There  exists  a  curved  illumination 
surface  S,.  composed  of  illumination  vectors,  tangent  to  SQ  along 
£w  and  intersecting  the  shaded  surface  Sg  along  E|g.  S,  is  a 
cylinder,  whose  axis  is  parallel  to  the  direction  of  illumination. 


Figure  36:  Shadow  Cast  By  a  Curved  Surface 


There  is  a  special  significance  to  the  line  in  the  image  tangent  to 
both  E1S  and  the  outline  of  S0:  it  is  an  illumination  vector,  such  as 
f  M  in  figure  36.  If  two  such  tangent  lines  are  visible  (as  with  £„  and 
E12  in  figure  36)  or  some  other  feature  is  visible  in  both  £ ^  and  E^, 
then  a  second  illumination  vector  can  be  found.  From  two 
illumination  vectors,  the  exact  position  of  the  light  source  can  be 
computed  and  the  shadow  point  Fgx  can  be  determined  for  each 
point  VQX  on  arc  E0. 

The  surface  S,  is  composed  entirely  of  illumination  vectors;  its 
gradient  at  each  point  must  therefore  lie  along  the  line  in 
gradient  space.  To  determine  this  line,  the  value  k  for  the  light 
source  position  must  be  given. 

If  the  light  source  is  not  infinitely  far  away,  each  illumination 
vector  such  as  CIX,  has  a  different  value  of  k  and  determines  a 
different  line  L^,,,  in  gradient  space.  However,  all  the  values  of  * 
can  be  computed  from  the  position  of  the  light  source,  given  a 
single  value  of  k  such  as  that  for  £„.  We  will  therefore  assume,  for 
simplicity,  that  the  light  source  is  infinitely  far  away,  and  that  a 
single  line  Exlum  exists. 
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Unfortunately,  no  stronger  statements  can  be  made  about  the 
gradient  of  SQ  from  examination  of  the  arc  £|Q.  In  particular,  the 
direction  of  the  tangent  line  fQX  bears  no  relationship  to  the 
gradient  of  SQ.  This  is  illustrated  in  figure  37,  which  depicts  two 
cylinders  tangent  to  the  same  illumination  plar:e.  The  arcs  of 
extinction  (dotted  lines)  have  completely  unrelated  directions  in  the 
image. 

IXumination 


Cylinders  with  copianar  arcs  of  extinction 

Figure  37:  Arcs  of  Extinction  Unrelated  To  Surface  Orientation 

However,  it  is  possible  to  use  the  shadow  edge  £|s  to  compute 
the  gradients  of  the  tangent  surfaces  along  fIO-  The  gradient  Gxof 
SQ  at  V'QX  is  the  same  as  the  gradient  of  S(  at  VQX,  since  S:  is 
tangent  to  SQ  at  that  point.  We  have  two  constraints  on  Gx  from 
properties  of  S,: 

1.  S,  it  an  illumination  surface,  so  Gx  lies  on  Li|kjm. 

2.  The  gradient  (Gx)  of  S,  at  VQX  is  the  same  as  the 
gradient  of  S,  at  Vgx  (the  shadow  of  since  S,  is  a 
cylinder  As  previously  shown,  Gx  and  Gs  (the  gradient 
of  the  shaded  surface  Ss)  must  lie  along  a  line  in  the 
gradient  space  which  is  perpendicular  to  the  line 
tangent  to  E(S  at  V^. 

The  constraints  on  Gx  are  illustrated  in  figure  38. 


Suppose  we  are  given  three  parameters  -  *  and  the  gradient  Gg 
of  surface  Sg.  From  these,  it  is  possible  to  compute  the  gradient  Gx 
of  the  tangent  plane  to  SQ  for  each  point  Vox  along  the  arc  w 
extinction  £0.  Using  the  definition  of  k, 

*a-\\Ej/k 

Since  E)x  it  contained  in  S|t 

"Mix  “  Gx  f  IX 

Also,  if  is  a  vector  tangent  to  E^  at  V^, 


"iZSX  =  GX  fSX  ■  Gs  £sx 
Combining  these. 


rP  t 

tlX  -1 

i  r  -  iifixii/a  ' 

[£sxTJ 

.  Gs  fsx  . 

Figu  re  39:  Using  En  to  Calculate  the  Gradient  of  S_ 


It  is  also  possible  to  use  knowledge  about  the  shape  of  the 
curved  object  SQ  when  Gs  is  not  known  in  advance.  Suppose  that 
two  vectors  EQX  and  E0Y  tangent  to  the  arc  of  extinction  En  at 
points  Zqj,  and  VQX  are  known.  Let  points  and  Vsy  be  the 
shadows  of  Vox  and  VQY,  let  E|X  and  E|y  be  the  illumination  vectors 
joining  VQX  to  and  vQY  to  vSY,  and  let  Esx  and  Esy  be  vectors 
tangent  to  the  shadow  edge  E^  at  l/sx  and  VSY  (Figure  38). 

If  (aXqx,  Ayox,  A*ox)  is  the  three-dimensional  vector 
corresponding  to  EQX,  with  similar  definitions  for  the  other  vectors, 
then  A2qX  and  azqy  are  known  in  advance.  As  previously  shown,  if 
Gx  is  the  gradient  of  S,  (and  Sq)  at  VQy,  then 


‘ox 


Gx£ox 


Since  £|x  is  an  illumination  vector, 

"DC  *  H£.xH '* 

and,  since  £|x  is  contained  In  S(  at  vQX,. 


GxEix 


Combining, 


So,  Gx  (and.  similarly,  Gy,  the  gradient  of  S,  at  V^),  can  be 
determined  exactly. 
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Now.  since  S,  and  Sg  intersect  at  V&f  atony  £gx 
~izsx  =  Gx  £sx  =  Gs  fsx 

Similarly, 


and  therefore,  Gg  can  be  determined  exactly.  Now,  Gg  can  be  used 
as  previously  shown  to  determine  the  gradient  of  Sq  at  each  point 
on  the  arc  of  extinction  E (q.  Here,  knowledge  of  *  and  the  direction 
tangent  to  E  at  two  points  has  sufficed  to  determine  the  gradient 
of  Sg  and  the  gradient  of  S0  at  all  points  along  E^. 


Figu  re  40:  Gradient  Space  Constraints  From  VQX  On  Gg 


In  the  special  case  that  SQ  is  spherical,  for  example,  the  entire 
arc  of  extinction  E ^  lies  in  a  plane  Sp  whose  surface  normal  is  an 
illumination  vector.  Therefore,  the  gradient  Gp  *  G(.  In  this  case, 
the  entire  problem  can  be  solved  with  only  one  parameter  (*)  given 
in  advance,  since  azQX  and  a*OY  can  be  calclated  directly: 


11  Hf„H 


OX 


fox'Gi 


MW*.  i> 
Ilf, ,11 


and 


k  (fpy  f I,) 


OY  -  ||f  || 

The  shadow  information  'fust  described  is  not  redundant  with 
information  available  in  the  same  line  drawings  when  no  shadows 
are  present 


5.  Shadow  Geometry  and  Other  Shape  Inference 
Techniques 

Shadow  geometry  can  be  combined  with  other  techniques  for 
determining  30  interpretations  from  images. 

S.1  Other  Gradient  Space  Techniques 
In  (13],  the  closed  form  solution  for  the  Basic  Shadow  Problem  is 
presented  in  the  form: 

G0  .  f(Gg,M 

When  *  is  given  in  advance,  GQ  is  shown  to  be  an  affine  transform 
(two-dimensional  linear  transform)  of  Gg. 


Stated  in  this  form,  it  is  very  convenient  to  use  shadow  geometry 
in  conjunction  with  other  techniques  for  determining  surface 
gradients.  For  example,  in  figure  41,  a  line  drawing  is  shown  in 
which  the  intensities  of  the  surfaces  are  known.  If  the  surfaces  are 
Lambertian  or  have  known  reflectance  functions.  Horn's  "shape 
from  shading"  technique  |3]  can  be  used  to  determine  a  contour  in 
gradient  space  along  which  Gg  must  lie,  and  a  similar  contour  tor 
Gq.  Now,  it  the  contour  for  Gg  is  transformed  in  its  entirety  by  the 
function  I  provided  by  shadow  geometry  (as  discussed  above),  a 
new  contour  for  GQ  is  provided  in  gradient  space  (figure  42).  Since 
Gq  must  lie  along  two  contours,  it  must  lie  at  one  of  the  points  of 
intersection  of  these  contours.  Now.  for  each  such  point,  the 
corresponding  point  Gg  can  be  determined  using  the  inverse  of 
transform  I. 


Figu  re  4 1 :  Shape  From  Shading 


Figure  42:  Shadow  Geometry  and  Shape  From  Shading 

Shadow  geometry  can  similarly  be  combined  with  Kanade  and 
Kender's  "skewed  symmetry"  [6],  as  in  figure  43.  Here,  skewed 
symmetry  provides  a  hyperbolic  contour  for  each  of  the  two  surface 
gradients  G0  and  Gs;  shadow  geometry  can  be  used  to  transform 
the  contour  for  Gg  into  an  additional  contour  for  G0-  The  points  of 
intersection  of  the  contours  for  GQ  are  then  the  possible  values  of 
Gq,  and  the  corresponding  values  of  Gg  can  be  found  as  above. 
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5.2  Shape  Recovery  for  Curved  Surface* 

Some  techniques  have  appeared  in  the  literature  for 
reconstructing  the  orientation  of  a  curved  surface  at  every  point, 
using  relaxation  techniques  [1, 5].  These  techniques  typically 
begin  with  the  surface  orientation  at  every  point  along  the  outline  of 
the  surface  (SQ  in  figure  44).  These  values  form  a  boundary 
condition  which  drives  the  relaxation  process. 


Figure  44:  Shadow  Geometry  and  Curved  Surface  Recovery 

In  this  paper,  we  have  seen  that  it  is  possible  to  determine  the 
surface  orientation  for  the  tangent  planes  at  each  point  along  the 
arc  of  extinction  using  three  a  priori  parameters  (such  as  the  A 
value  for  the  light  source  and  the  orientation  of  the  surface  on 
which  the  shadow  appears).  These  values  can  be  used  to  provide 
stronger  boundary  conditions  for  relaxation  techniques. 

Surface  orientations  along  the  arc  of  extinction  are  valuable  for 
another  reason.  Relaxation  techniques  must  make  some 
presumptions  about  the  curvature  of  the  surface  (e.g.  surface  of 
minimum  curvature,  cubic  or  other  surface  of  revolution).  Since  all 
of  these  models  ol  curvature  are  consistent  with  the  tangent 
gradients  along  the  outline  ol  SQ.  it  is  not  possible  to  decide  which 
model  is  appropriate  when  the  only  boundary  condition  comes  from 
the  outline  ol  Sc.  However,  when  the  arc  of  extinction  is  also  used, 
it  may  be  possible  to  select  one  from  several  possible  models  Of 
surface  curvature,  or  to  measure  systematic  deviation  from  a 
particular  model  for  a  specific  object. 

6.  Conclusions 

This  .taper  has  presented  a  theory  describing  relationships 
among  surface  orientations  in  line  drawings  with  shadows.  The 
relationships  arise  from  hypothesizing  the  existence  of 
"illumination  surfaces"  connecting  shadow  edge  pairs,  assigning 
appropriate  line  labels  to  shadow  and  shadow-making  edges,  and 
applying  the  resulting  constraints  in  the  gradient  space. 

This  technique  falls  short  of  providing  exact  solutions  to  shadow 
geometry  problems.  The  line  drawing  must  be  augmented  with 
information  such  as  the  orientations  or  curvature  of  specific 
surfaces  or  the  position  of  the  light  source  if  exact  surface 
oriuPhttions  are  to  be  found. 

It  h«<*  been  shown,  however,  that  shadow  geometry  provides 
important  benefits  for  image  understanding: 

•  Shadows  allow  you  to  subsitute  information  about  the 
light  source  position  instead  of  a  priori  knowledge 
about  surface  orientations. 

•  Shadows  allow  you  to  determine  geometric  information 
from  highly  visible  shadow  edge  pairs  instead  of  using 
many  of  the  unreliable  edges  within  shaded  portions  erf 

an  image. 

•  An  increasing  amount  of  information  is  provided  by  the 
shadow  edge  when  the  shadow  falls  on  many  visible, 
differently  oriented  surfaces. 


•  Shadows  provide  some  constraint  when  curved 
surfaces  are  involved. 

•  Shadows  provide  constraint  between  surfaces  even 
when  they  do  not  touch  in  the  scene  (or  image). 

•  Shadows  allow  the  solution  to  one  shadow  problem  to 
be  used  in  the  solution  of  other  shadow  problems, 
since  typical  shadow  problems  are  mutually 
constrained  (e.g.  same  light  source,  same  background 
surface). 

In  addition,  some  observations  have  been  made  about  the 
solution  of  the  cot  respondence  problem  for  shadows,  which  must 
be  solved  before  surface  orientations  can  be  inferred. 
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ABSTRACT 

We  present  two  methods  to  solve  the  problem  of 
matching  linear  features  extracted  from  an  aerial 
image  with  a  set  of  linear  features  derived  from  a 
map  or  another  view  of  the  same  scene.  The  first 
method,  using  discrete  relaxation, is  very  efficient 
but  requires  the  model  to  have  a  stall  number  of 
elements.  The  other  one,  called  the  kernel  method, 
demonstrates  how  drastically  we  can  improve  the 
running  time  if  we  know  a  few  pairs  of  matched 
elements.  Illustrative  examples  are  provided,  and 
extensions  are  discussed. 

1.  INTRODUCTION 


Suppose  that  we  are  given  a  very  high 
resolution  aerial  picture  taken  from  a  known 
altitude  and  with  a  known  orientation,  together  with 
a  detailed  map  of  the  area.  Haw  can  we  determine 
rfiich  parts  of  the  picture  correspond  to  given 
elements  of  the  map?  The  complexity  of  this  problem 
stems  partly  from  the  fact  that  a  picture  is 
described  in  terms  of  pixel  intensities  while  the 
map  is  a  set  of  high  level  abstract  entities.  There 
have  been  several  tentative  answers  to  this 
question.  Early  systems  worked  directly  with  the 
intensity  array  [1,2,3],  trying  to  find 
transformations  which  map  one  array  into  another. 
Problems  arise  when  the  illimination  changes 
substantially  or  when  the  texture  changes  with  the 
seasons. 

Price  and  Faugeras  [4,5],  extracted  linear 
features  and  regions  to  be  matched  with  a  map  whose 
characteristics  are  derived  manually.  They  use 
relative  position  constraints  and  stochastic 
labeling. 

The  Hughes  Research  Laboratories  [6,7,8] 
conducted  studies  to  match  two  views  of  a  scene 
using  line  and  vertex  features  derived  from  the 
scene. 

He  first  extract  features  from  the  intensity 
images  using  the  USC  1 ''ear  feature  extraction 
system  [9].  The  technique  consists  of  convolving 
the  image  with  6  directional  edge  masks,  each  5*5 
pixels,  choosing  the  maxima,  thinning  and 
thresholding  the  convolved  output,  linking  the 
resulting  edges  based  on  proximity  and  orientation, 
and  finally  approximating  by  straight  lines. 

The  two  linear  features  consider  ate 
sraans  and  ARMS.  Se^sents  are  linear 
approx imations  of  a  aet  of  connected  edge  points. 


They  are  defined  by  their  endpoints,  orientation, 
length  and  average  contrast.  Apars  are  a 
representation  of  two  segments  separated  by  a  small 
gap  and  with  orientations  differing  by  180@  .  Apars 
are  very  appropriate  to  represent  roads  and  rivers. 
If  the  scene  is  to  be  matched  with  a  map,  we  encode 
the  linear  pieces  of  the  map  manually.  The  problem 
can  now  be  formulated  as:  Which  elements  in  the 
image  correspond  to  the  given  elements  in  the  map, 
based  on  geometrical  constraints. 

‘Hie  next  section  provides  assumptions  and 
definitions,  the  third  section  describes  the 
relaxation  method,  the  fourth  describes  the  derived 
kernel  method,  the  fifth  presents  results  and  the 
conclusion  outlines  possible  extensions. 

2.  ASSUMPTIONS  AND  DEFINITIONS 

We  assume  that 

1.  The  model  and  the  scene  have  approximately  the 
same  orientation,  since  the  orientation  of  the  plane 
at  the  time  of  the  picture  is  known. 

2.  The  scaling  factor  from  the  model  to  the 
scene,  p  ,  is  known,  since  the  camera 
characteristics  and  the  altitude  of  the  plane  are 
known. 

Let  us  define  the  following  terms:  We  will  denote 
the  linear  features  of  one  image  as  aif  l<i<n ,  and 
cadi  them  objects. 

We  will  denote  the  linear  features  of  the  other 
image,  or  of  the  map,  as  Xs,  lcj<m,  and  call  them 
labels.  3 

The  set  A»{a  jj  l<i<nj  is  the  scene. 

The  set  L-ilj  I  T<J<m  }  is  the  model. 

Vie  are  interested  in  cxmputing  the  quantity 
p(i,j)  which  is  the  possibility  for  object  a  to 


have  label  >j.  p(i,j)  is  a  discrete  variable  that 
can  be  either  0  or  1>  it  is  NOT  a  probability 
because 

1.  object  a^  may  have  no  label  (  Z  p(i,j)  *  0). 

j 

2.  several  objects  may  have  the  same  label 
<  I  PC  i , j )  >  1). 

3.  object  a,  may  have  several  labels  (  Z  p( i , j)  > 

1).  j 

p0(i,j)  represents  the  initial  assigrment. 

P*(irj)  represents  the  assigrment  at  the  tth 
iteration. 


*1 


I 


The  methods  presented  here  principally  rely  0^1 
geometrical  constraints,  meaning  that  trtien  we  assign 
a  label  X  j  to  an  object  a£,  we  expect  to  find  an 
object  a^  with  a  label  in  a  certain  position 

depending  on  i,j,k.  This  area  is  denoted  w(i,j,k) 
and  is  called  the  window) i,j ,k) . 

,  Let  us  represent  any  object  a^  _b£  a  vector 
A^Bi,  and  any  label  Xj  by  a  vector  PjQj.  We  know 
where  any  other  label  Xk(Pk8k)  “  the  model, 

relative  to  X  j  :  it  is  iniquely  defined  by  Pj?k  for 
example.  Hav ing  this  information,  it  should  be 
simple  to  find  the  corresponding  zone  in  the  scene, 
as  illustrated  in  the  example  of  figure  1. 

Let  us  consider  object  a£  (Aj6'j)  with  assigned 
label  X-,  .  TO  define  w(i,j,k),  we  "slide* 

over  j  fo  obtain  two  extreme  positions: 

1.  Identifying  A£with  Pj  we  define  the  2  points 
R1,S1  by 

a)  =  OSi  +  u  "Pity 

b)  05  £  »  <51? x  +  u  *  ^Sic- 

2.  Identifying  B£  with  Qj  we  define  the  2  points 
R^Sjby 

a)  OR2  =  <S i  +  *  QjPk. 

b)  OS2  =  OR2  +  *  I^Qk- 

These  4  points  R£,  Si,  R2,  S2  are  the  4  corners 
of  a  window  w(i,j,k)  in  tfiich  we  should  find  an 
object  a  with  the  label 

Finally,  we  need  to  define  the  relation  fe,  "is 
compatible  with",  between  (i,j)  and  (h,k)  as 
(i,j)  IS  COMPATIBLE  WITH  (h,k) 

<•»>  (i,j)  fc  (h,k) 

<»»>  a  in  w(i,j,k)  AND  a  in  w(h,k,j). 

We  need  to  check  both  predicates  because  the 
relation  "is  in  w"  is  not  symmetric,  that  is  ^  in 
w(i,j,k)  does  not  imply  a£  in  w(h,k,j). 

We  now  can  proceed  to  explain  how  the  methods 
operate. 

3.  DESCRIPTION  OF  THE  REIAXATION  METHOD 

This  method  takes  a  scene  and  a  model  as  input. 
We  first  assign  possibilities  based  on  comparable 
angular  orientation  of  an  object  a  and  a  label  ; 
then,  for  each  such  pair  (i,j),  we  verify  that,  for 
any  other  label  ,  we  find  an  object  a  with  this 
label  in  the  window  w(i,j,k)  where  we  expected  to 
find  one.  Some  tolerance  is  accepted  to  take  into 
account  inaccuracy  and  partial  matches.  If  the 
evidence  is  sufficient,  we  keep  the  pair  { i , j ) 
otherwise  we  discard  it.  We  iterate  nr  til  we  reach 
a  stable  configuration,  that  is  mtll  we  reach  an 
iteration  where  no  pair  is  discarded.  Figure  2 
shows  a  flowchart  of  the  procedure. 

Let  us  now  explain  this  process  in  formal 
tenns. 


Given  a  set  of  objects  A=(  a;l  1  £  i  <  n) 
(scene)  and  a  set  of  labels  L={  X j  I  1  <  j  £  m  } 
(model)  ,  we  are  looking  for  a  subset  w={  (a^.xT)  I 

p( i , j ) =1  )  of  the  cartesian  product  A*L  which  is  the 
set  of  objects  matched  with  a  label. 

Let  Mt  be  the  superset  of  M  at  the  tth 
iteration.  (ai,  Xj  I  pfc(i,j)=l  1 

1.  Initial  assignment  of  possibilities 

Let  6£be  the  angular  orientation  of  aj,  let  4>j  be 
the  angular  orientation  of  Xj.  Then,  for  all  (X,j) , 
p  (i,j)*l  iff  I e i-4> j  I  £  15  . 

2.  Iteration  formula 

For  all  (i,j)  we  define  pt+1(i,j)  =  1  iff  i ,3) =1 

and  V  k  in  [l,m),3h  in  [l,nj  such  that  (i,j) 
fe  (h,k),  that  is  a^  is  in  w(i,j,k)  and  a*  is  in 
w(h,k,j) . 

3.  tore  of  the  method 

If  there  exists  only  a  partial  match,  then  M  =>  <t>  • 
This  situation  also  occurs  if  some  objects  are 
slightly  out  of  place.  Since  we  are  interested  in 
partial  matches,  we  introduce  the  quantity  q  and 
modify  the  iteration  formula  as  follows:  pt+J(i,j)  * 
1  iff  p  =1  and  there  exist  a  subset  S  of  [l,m]  with 
q  elements  such  that  V  s  in  S,  3  k  in  [l,n]  such 
that  (i,j)  fe  (k,s) . 

q  is  a  measure  of  the  way  scene  and  model  agree. 
Setting  q  to  m  means  that  we  know  that  there  is  a 
perfect  match  "a  priori."  We  will  denote  the 
resulting  set  at  the  tth  iteration  as  Mq  .  The 
stopping  criterion  is  simply  Mq  .  We  main 

result  is  that  this  process  converges  in  a  finite 
number  of  iterations.  Fbr  more  detailed  analysis, 
see  110,11). 

Some  examples  of  successful  application  are  shown  in 
the  section  describing  results. 

4.  DESCRIPTION  OF  THE  KERNEL  METHOD 

toe  of  the  main  problems  with  the  method 
described  above  is  that  the  nunber  of  labels  in  the 
model  has  to  be  small  for  the  method  to  be 
efficient.  That  leads  us  to  selecting  a  few  labels 
with  no  really  valid  criterion.  Let  us  see  how  we 
can  improve  tie  method  if  we  know  fbr  sure  that  some 
pairs  (i,j)  are  in  the  set  M  . 

Let  B  be  a  subset  of  A  with  q  elements,  B  ■  { 

bi  I  1  <  i  <  q  )  and  B  <=  A. 

Let  T  be  a  subset  of  L  with  q  elements,  T  *  (  t  II 
<  i  <  q  )  and  TeL, 

such  that  all  pairs  [(b^,t^)  ,(bj,tj))  are  pairwise 
compatible. 

Obviously,  all  these  couples  are  in  M<y  let  us  call 
the  set  of  all  such  couples  Xq.  Now,  a  sufficient 
condition  for  any  pair  (a^,Xj)  to  be  in  Mq  is  simply 
that 

either  (»i,Xj)  -  <t*,tk>  for  some  k  in  |l..q| 
or  V  k  in  [l..q]  ,  (a^Aj)  t  (b^tfe) 

Let  us  denote  this  newly  obtained  set  by  Na 
-  {  (a£,X  J  |  (a:,Xa)  in  Kq  or  V  k  in  [l..q], 
(*i,  Xj)  ft  (b^t^)  } 
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We  can  see  that  M  £  Nq  o  Mq  ,  therefore ,  the 
new  set  is  better  that  the  previous  one. 

Finding  the  Kernel 

Oie  of  the  problems  encountered  by  the 
relaxation  method  is  that,  having  to  reduce  the 
number  of  labels,  we  had  to  choose  some  with  no 
valid  criterion.  Now,  to  find  a  valid  kernel,  we 
consider  the  full  model  and  choose  a  small  nimber  of 
objects  in  the  scene,  that  is  reverse  the  role  of 
scene  and  model.  The  difference  is  that  all  lijels 
look  alike,  but  we  can  choose  objects  that  have 
distinctive  attributes,  i.e.  they  are  long, 
isolated,  and  correspond  to  strong  edges.  The 
procedure  is  then: 

1.  Choose  N  objects  from  the  scene. 

2.  Match  them  with  the  model  using  the  relaxation 
method  with  q  <  N. 

Find  q  matched  pairs  that  verify  pairwise 
canpatibility.  These  q  pairs  are  the  kernel. 

A  flow  chart  of  the  procedure  is  shown  in  fig. 

3. 

It  is  worth  noting  that  finding  the  kernel  as 
described  here  is  equivalent  to  finding  a  complete 
subgraph  of  a  certain  size  in  a  graph,  and  that  this 
problem  is  NP-complete.  However,  the  complexity  of 
the  problem  does  not  interfere  with  the  process 
since  we  control  the  input  size. 

Discussion 

There  are  some  very  interesting  properties 
associated  with  this  method: 


1.  We  no  longer  need  to  limit  the  size  of  the 
model. 

2.  It  is  a  one  pass  method  giving  a  fast  yes/no 
answer  for  each  object  in  the  scene. 

3.  The  map  can  be  replaced  by  another  scene,  such 
as  a  different  view  of  the  same  scene. 

4.  Since  the  method  to  find  the  kernel  is  fast,  we 
can  "forget"  that  te  know  the  relative  orientation 
of  scene  and  model  and  derive  it  or  refine  it. 

5.  RESULTS 

These  methods  have  been  applied  to  2  scenes 
representing  part  of  the  Fort  Belvoir  Military 
Reservation  in  Virginia.  The  original  pictures  have 
been  provided  by  the  Defense  Mapping  Agency  and  the 
full  resolution  images  are  2048  *  2048.  Figure  4a 
show  the  first  view,  taken  in  August,  at  full 
resolution.  Figure  4b  shows  the  second  view,  taken 
in  November,  at  full  resolution.  Figure  5  is  the 
part  of  the  map  corresponding  to  the  previous 
Images.  As  we  can  see,  the  original  images  are  very 
detailed,  and  in  order  to  segment  them,  we  proceed 
hierarchically:  lb  find  the  most  prominent  features 
sujch  as  large  roads  and  rivers,  we  use  lower 
resolution  images,  as  shown  on  figures  6a  and  6b 
that  have  a  resolution  of  256  *  256.  Now,  as 


explained  in  the  introduction,  we  extract  the  edges, 
thin  them  and  link  them  to  obtain  the  linear 
features  shown  on  figures  7a  and  7b.  As  we  can  see, 
most  snail  details  have  vanished.  Since  we  are 
interested  in  roads  and  rivers,  we  extract  the  apars 
(antiparallel  lines)  with  a  maximum  width  of  8 
pixels  and  filter  out  the  very  small  ones  .  The 
resulting  scenes  are  shown  in  figures  8a  and  8b. 

5.1  Relaxation  Method 

Tb  illustrate  the  relaxation  method,  we 
manually  generate  from  the  map  a  model  of  the  main 
highway,  as  shown  in  figure  9,  and  match  it  against 
the  scene  of  figure  8a.  The  result  is  shown  in 
figure  10  and  is  the  desired  result. 

Cnee  the  prominent  features  are  identified  in 
the  low  resolution  image,  we  can  compute  better 
estimates  of  the  scale  and  orientation  and 
concentrate  on  the  details  of  the  full  resolution 
image,  as  shovn  in  fig.  11:  Ebr  each  object  in  the 
map,  suich  as  buildings,  we  can  define  a  small  window 
in  which  this  object  should  be,  if  effectively 
present.  This  is  illustrated  in  fig.  12: 
fig.  12  (a)  shows  the  segments  inside  the  window 
where  2  buildings  should  be, 

fig.  12  (b)  is  the  hand  made  model  of  these 

buildings,  derived  from  the  map.  Their  shape  is 
voluntarily  inaccurate  to  show  that  the  method  is 
not  sensitive  to  such  snail  variations, 
fig.  12  (c)  exhibits  all  the  segments  from  fig.  12 

(a)  that  are  matched  with  the  model  of  fig.  12  (b) . 

5.2.  Kernel  Method 

He  will  use  the  images  at  low  resolution  and 
apars  as  primitives.  Tb  illustrate  the  efficiency 
of  the  kernel  method,  we  first  provide  the  model 
from  the  map  as  shown  in  figure  13.  This  is  a  full 
model  and  contains  35  labels.  We  use  a  kernel  of  4 
elements  to  match  it  with  the  first  scene  (fig.  8 
(a)).  The  resulting  set  of  matches  is  shown  in 

figure  14.  The  processing  time  was  8.5  seconds,  not 
counting  time  to  compute  the  kernel.  Tb  compare  its 
performance  with  the  relaxation  method,  we  matched 
the  same  scene  with  the  full  model  and  a  value  q  * 
9.  the  result,  show  in  figure  15,  took  750 

seconds,  and  contains  more  errors. 

We  also  generated  a  kernel  from  the  first  scene 
to  match  the  first  view  with  the  second  view, 
considered  as  the  model.  The  second  view  is  a 
rather  complex  scene  becaiuse  the  original  image  is 
very  textured,  and  contains  many  objects; 
furthermore,  some  long  segments  are  broken  into 
small  pieces.  However,  the  method  was  successful, 
as  shown  in  figure  16. 

6.  CONCLUSION 

this  paper  demonstrates  how  a  snail  quantity  of 
"a  priori"  knowledge  can  transform  a  hard  problem 
into  a  simple  one.  The  "expensive"  processing, 
namely  relaxation,  is  used  to  find  a  good  match  on  a 
small  sufcaet  of  a  scene,  thus  allowing  the  decision 
for  the  other  elements  of  the  scene  to  be  simple  and 
fast.  This  method  can  be  generalized  to  work  on  all 
elements  of  an  image  that  can  be  modeled  in  terms  of 
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vectors  in  a  2-d  space;  however,  applying  it  to  all 
the  edges  of  the  full  resolution  image  (2048  *  2048) 
does  not  appear  to  be  a  natural  way  to  proceed, 
therefore  we  are  currently  investigating  the 
existence  and  representation  of  intermediate 
primitive  features  with  a  higher  semantic  meaning 
than  segments: 

After  the  original  match  at  low  resolution,  we 
encode  the  map  in  terms  of  these  new  primitives  and 
match  this  model  with  the  scene,  and  only  then  do  we 
match  each  primitives  with  corresponding  segments. 
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Figure  1.  Example  of  window  design. 
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Figure  2.  Flow  chart  of  the  relaxation  method. 


Figure  3.  Flow  chart  of  the  kernel  method. 
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(a)  August 


(b)  November 

Figure  4.  Full  resolution  view  (2048  x  2048) 


(a) August 


(b)  November 


Figure  6.  Low  resolution  view  (256  x  ?56) 
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Figure  13.  Full  model  from  the  map 


Figure  14.  Apars  of  Fig.  8(a)  matched 
with  the  model  of  Fig.  13 
using  kernel  method. 


Figure  16.  Matching  8(b)  with  8(a)  as 
the  model. 


Figure  15.  Apars  of  Fig.  8(a)  matched 
with  the  model  of  Fig.  13 
using  relaxation  method. 
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This  paper  describes  a  new  approach  to  representing  and 
reasoning  with  temporal  information.  A  wide  variety  of  tem¬ 
poral  specifications  can  be  converted  into  line.tr  inequalities 
relating  the  endpoints  of  the  events.  Linear  programming  is 
then  used  to  represent  these  constraints  and  perform  deduc¬ 
tions.  The  information  is  modularised  into  semantically  re¬ 
lated  clusters  of  events  each  with  its  own  tableau  and  related 
to  each  other  by  a  reference  frame  transformation.  This 
provides  a  uniform  formally  adequate  representation  which 
is  complete  and  also  computationally  efficient. 

fv 

Introduction 

Most  work  in  artificial  intelligence  which  deals  with  real 
world  problems  would  require  some  reasoning  with  time. 
The  importance  of  such  a  temporal  understanding  in  the 
areas  of  problem  solving  and  natural  language  understand¬ 
ing  has  been  recognized  earlier[l,  5].  In  image  understand¬ 
ing  applications  too,  it  is  necessary  to  know  about  sequences 
of  events  in  order  to  identify  a  dynamically  changing  scene. 

Most  problem  solving  systems  have  modelled  time  us¬ 
ing  a  state-space  approach.  In  this  approach  the  world  is 
described  as  a  sequence  of  snapshots  each  with  a  set  of  facts 
holding  at  the  time  instant.  Because  of  the  inadequacy  of 
this  approach,  attempts  have  been  made  to  incorporate  time 
explicitly  in  planning  [3,6,5].  In  particular  Vere’s  DEVISER 

[5]  is  a  general  purpose  planner  which  generates  parallel 
plans  to  achieve  goals  with  imposed  time  constraints.  Both 
durations  and  start  time  windows  for  sets  of  goal  conditions 
may  be  specified.  The  parallel  plans  consist  of  not  just 
actions  but  also  of  evcnts(triggered  by  circumstances),  in¬ 
ferences,  and  scheduled  external  events  (completely  beyond 
the  actor’s  control). 

However  Vere’s  system  is  not  a  general  purpose  tem¬ 
poral  reasoning  system.  Time  is  incorporated  into  the  plan¬ 
ner  with  the  notion  of  a  window  -which  is  typically  an  up¬ 
per  and  lower  bound  on  the  time  when  an  activity  may  oc¬ 
cur.  Windows  are  specified  explicitly  for  goals  and  are  com¬ 
puted  dynamically  during  plan  generation  by  considerations 
of  durations  of  intervening  activities  and  the  times  of  oc¬ 
curence  of  scheduled  external  events.  This  is  done  by  win¬ 
dow  revision  algorithms  which  push  activities  forward  or 
backward  on  the  time-line  when  they  get  ordered  or  when 
the  durations  become  known.  The  tempoial  representation 


and  reasoning  is  ad  hoc  and  tied  to  the  needs  of  the  planner. 

A  more  general  purpose  approach  is  that  taken  in  the 
systems  [1,2,4]  that  build  lime  specialists.  Such  a  subsys¬ 
tem  maintains  temporal  relations  and  provides  the  rest  of 
the  system  with  tools  to  store,  retrieve,  delete  and  reason 
with  the  temporal  information.  There  are  two  major  re¬ 
quirements  for  a  lime  specialist:  First,  it  must  be  formally 
adequate ,  and  second  it  must  be  computationally  effective. 
The  first  condition  is  met  if  the  formal  system  is  coherent 
and  consistent,  and  contains  sufficient  mechanisms  to  be 
able  to  represent  all  temporal  specifications  and  perform  all 
the  deductions  we  want.  The  second  requirement  is  essen¬ 
tial  in  order  to  have  the  program  produce  answers  with  a 
reasonable  amount  of  effort.  This  paper  describes  another 
attempt  in  this  direction. 

In  the  next  section  we  develop  the  notion  of  temporal 
specification.  The  following  section  describes  the  repre¬ 
sentation  scheme.  Lastly  we  describe  the  deductions  that 
can  be  performed  by  the  reasoning  module. 

Temporal  specifications 

A  temporal  specification[4]is  a  statement  that  partially 
specifies  in  some  manner,  the  time  of  one  or  more  events. 
Examples  are: 

(1)  The  gas  leak  started  immediately  after  takeoff. 

(2)  John  saw  Mary  a  while  ago. 

(3)  My  fever  lasted  3  days. 

(4)  A  few  days  back,  I  was  in  Las  Vegas. 

(5)  Hiroshima  was  bombed  on  August  6,  1945. 

(6)  1  will  finish  my  PhD  in  two  or  three  years. 

(7)  Jack  had  an  accident  a  month  after  getting  to 
Boston. 

The  next  three  subsections  attempt  to  clarify  different 
aspects  of  this.  However,  it  may  be  observed  that  except  for 
(2)  and  possibly  (4)  all  the  specifications  are  linear  relations 
between  the  endpoints  of  events. 

1.  Modelling  uncertainty. 

As  the  examples  indicate  temporal  information  is  often 
incomplete  and  imprecise.  However  one  can  draw  a  distinc¬ 
tion  between  two  kinds  of  incompleteness. 
(a)Underconstrained  in  the  mathematical  sense. 
When  we  say  that  event  B  occurs  after  event  A,  the  as¬ 
sociated  information  is  expressed  in  the  inequality  starts  > 
end  a-  There  are  an  infinite  number  of  pairs  of  values 
for  starts  and  endA  that  satisfy  this  inequality.  However 
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the  mathematical  meaning  of  this  statement  is  absolutely 
precise. 

(b)Linguistic  fuzziness.  When  we  say  that  “Jane  saw  the 
doctor  a  few  days  ago  ”  the  problem  is  that  of  translating 
from  the  natural  language  phrase  to  an  assertion  in  the 
formal  representation  used.  Several  approaches  [7]  can  be 
taken: 

1.  Uncertainty  intervals—  Some  cases  of  vagueness 
can  be  thought  of  as  a  value  at  an  indeterminate  point  in  a 
well-defined  interval. The  phrase  Last  year  means  a  point 
somewhere  in  the  interval  12:01  am  January  1,  1981  to 
12:00  midnight,  December  31,  1981.  The  phrase  a  few 
days  ago  can  be  defined  (arbitrarily)  as  between  2  and  10 
days. 

2.  Plausibility  distributions —  It  may  be  better  to 
think  of  an  imprecise  term  as  being  more  or  less  applicable 
over  a  range  of  values.  The  phrase  a  few  days  ago  would 
be  very  applicable  exactly  3  days  ago,  moderately  applicable 
7  days  ago,  and  highly  inapplicable  l  year  ago.  The  degree 
of  applicability  can  be  quantified  by  using  probabilities  and 
manipulations  dona  using  fuzzy  theory. 

3.  Linguistic  variables--  Instead  of  converting  im¬ 
precise  phrases  to  numerical  measures,  one  leaves  them 
in  a  symbolic  form  and  use  ad  hoc  rules  to  manipu¬ 
late  ..cm.  This  approach  is  used  in  Kahn  and  Gorry(4j 
who  would  translate  a  sentence  John  will  hnish  his 
thesis  in  a  few  months  into  a  time-expression  of  the 
form  (AFTER(ALL-OF  TODAY  (FUZZY- AMOUNT 
(NIL  A-FEW  MONTHS).  The  structure  for  aFUZZY- 
AMOUNT  includes  a  quantifier  (eg  ABOUT,  NEARLY) 
and  a  fuz zy- number  (eg  A-FEW, SEVERAL). 

In  our  system,  the  uncertainty  interval  approach  is 
used  No  attempt  is  made  to  translate  phrases  of  the  type 
a  few  days  ago  as  it  is  believed  to  he  primarily  a  problem 
of  natural  language  understanding. 

2.  Time  Points  vs  Time  Intervrls. 

Temporal  information  can  be  given  both  as  relations 
between  endpoints  and  between  intervals.  A  fact  like 
The  bank  opens  at  9:00  am  is  a  statement  about  the 
startpoint  of  the  event  bank-is-open.  A  fact  like  The 
Cuban  missile  crisis  took  place  during  Kennedy’s 
term  is  a  statement  about  intervals.  The  two  approaches 
are  equivalent  in  terms  of  representational  power.  Any  state¬ 
ment  about  intervals  can  be  converted  to  an  equivalent  state¬ 
ment  about  the  endpoints  of  the  associated  events  eg 


A  before  l)  «  end  a  <  startn 

If  one  is  using  an  interval-oriented  approach,  endpoint 
information  can  be  represented  by  associating  with  each 
event  two  instantaneous  events — one  corresponding  to  the 
start  of  the  parent  event  and  the  other  to  the  end. 

For  the  convenience  of  the  user,  both  forms  of 
specification  are  allowed.  The  internal  representation  is  in 
terms  of  endpoints  as  that  is  more  natural  for  our  repre¬ 
sentational  scheme. 

3.  Absolute  vs  Relative:  Changing  Reference 
frames. 

The  everyday  notion  of  time  is  relative  rather  than  ab¬ 
solute.  We  remember  events  in  semantically  related  clusters 
revolving  around  a  key  event.  F.ven  time  as  measured  by  the 
Gregorian  calender  is  relative  rather  than  absolute.  This 
suggests  an  organisation  for  the  temporal  information  — 
semantically  related  clusters  each  of  which  represents  infor¬ 
mation  about  the  events  in  its  own  reference  frame.  As  in 
the  case  of  spatial  information,  it  then  becomes  necessary  to 
provide  for  transformations  to  relate  the  time  coordinates 
of  events  in  different  reference  frames.  The  transformation 
is  linear. 

The  representation 

As  may  have  been  observed,  all  the  temporal 
specifications  arc  equivalent  to  linear  relations  between  the 
endpoints  of  the  events.  1  his  means  that  we  can  use  linear 
programming  to  represent  and  reason  with  temporal  infor¬ 
mation.  A  time  specialist  based  on  linear  programming  is 
guaranteed  to  be  formally  adequate  unlike  the  ad  hoc 
methods.  It  provides  a  uniform  representation  for  stor¬ 
ing  the  wide  variety  of  temporal  information.  This  is  a 
major  change  from  the  philosophy  of  earlier  systems.  Kahn 
and  Gorry[4]  use  several  different  ways  of  organizing  the 
events — with  a  date-line,  using  before/aftcr  chains,  and  us¬ 
ing  special  reference  events  each  with  a  separate  procedure 
for  making  deductions.  Allen [1]  uses  a  network  of  con¬ 
straints  to  maintain  all  possible  relationships  about  how  the 
intervals  in  it  are  related.  However,  in  his  system  no  metric 
information  is  represented  and  thus  fails  to  be  formally  ade¬ 
quate  by  onr  criteria. 

To  people  conditioned  to  react  with  horror  to  uniform 
formally  adequate  schemes,  our  representation  would  im¬ 
mediately  raise  the  specter  of  inefficiency.  Indeed,  this 
would  be  so  if  all  the  temporal  facts  about  the  domain  were 


to  be  represented  in  the  same  linear  programming  tableau. 
Recall,  however  that  we  can  organize  the  information  in 
semantically  related  clusters — each  with  only  a  small  num¬ 
ber  of  constraints.  The  system  still  remains  complete  be¬ 
cause  we  can  do  a  reference  frame  transformation  to  relate 
events  in  different  clusters.  This  idea  buys  us  the  same  ad¬ 
vantage  as  the  reference  interval  concept[l,4)  in  a  more  sys¬ 
tematic  way.  The  analogy  with  the  way  we  organise  and 
reason  with  spatial  information  suggests  the  naturalness  of 
this  approach. 

Now  for  the  details — the  linear  programming  is  done 
using  the  simplex  algorithm  in  the  version  formulated  by 
Tucker  In  this  approach  the  rows  of  the  tableau  have  a 
direct  physical  meaning— they  correspond  to  the  endpoints 
of  the  events.  The  system  was  written  in  Maclisp  in  the 
Acronvm  environment  so  that  it  could  be  used  easily  as  a 
module  for  future  image  understanding  work. 

Temporal  Reasoning 

As  the  tableau  represents  all  the  information  in  the 
temporal  specifications,  the  system  is  complete  — all'deduc- 
tions  that  can  be  made  from  the  constraints  can  be  made 
from  the  tableau.  The  general  approach  is  to  formulate 
an  expression  which  is  maximised  or  minimised,  while  still 
satisfying  the  constraints  in  the  tableau.  The  implemented 
features  include 

1.  Satisfiability — As  the  linear  constraints  as¬ 
sociated  with  each  temporal  specification  are  entered  into 
the  tableau,  the  existence  of  a  feasible  solution  is  checked. 
The  system  refuses  to  accept  a  constraint  that  is  inconsis¬ 
tent  with  the  previous  set. 

2.  Bounds —  One  can  determine  the  upper  and  lower 
bounds  for  any  variable,  which  corresponds  to  an  endpoint 
of  an  event,  or  a  linear  expression  in  these  variables.  For 
example,  this  permits  us  to  find  upper  and  lower  bounds  on 
the  duration  of  an  event. 

3.  Possibility  and  Necessity — .  If  a  predicate’s  be¬ 
ing  true  would  not  be  inconsistent  with  the  constraints  in  the 
tableau,  the  predicate  is  said  to  be  possible.  If  a  predicate's 
being  false  would  be  inconsistent  with  the  constraints  of  the 
tableau,  the  predicate  is  said  to  be  necessary.  These  deduc¬ 
tions  would  be  useful  to  a  planner  using  this  time  specialist. 
If  event  A  is  necessarily  after  B,  then  that  ordering  can  be 
done  right  away.  Possibility  considerations  can  help  prevent 
unnecessary  backtracking. 
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Abstract 


The  Hough  transform  provides  a  paradigm  for  multi¬ 
dimensional  pattern  detection  and  more  general  parameter 
extraction.  It  has  good  properties  in  noise,  but  as  usually 
implemented  requires  an  accumulator  array  whose  size  is 
exponential  in  the  number  of  parameters  extracted.  The 
idea  of  accumulating  votes  in  a  small  content-addressable 
store  raises  many  technical  issues,  some  of  which  are 
outlined  here.  Simulation  results  illustrate  some  of  the 


The  preparation  of  this  paper  was  supported  in  part  by  the 
Office  of  Naval  Research  under  Grants  N00014-80-C0197 
and  N00014  82-K0193,  and  in  part  by  the  Defense; 
Advanced  Research  Projects  Agency  under  Grant  N00014 
78-C0164. 


1.  Introduction 

The  Hough  Transform  (HT)  maps  a  point  of  one 
parameter  space  A  (often  a  local  feature  space  or 
generalized  image)  into  a  set  of  compatible  points  of 
another  space  B  (often  a  space  of  "higher-level,"  more 
global  parameters)  (Hough  1962;  Duda  and  Hart  1972; 
Ballard  1981],  This  process  is  sometimes  called  "voting" 
(the  feature  points  vote  for  possible  higher-level  constructs 
of  which  they  could  be  a  part). 

HT  is  traditionally  implemented  by  accumulating 
votes  in  a  quantized  version  of  the  parameter  space  B, 
called  an  accumulator  array.  A  naive  implementation  of 
the  accumulator  array  uses  space  exponential  in  the 
number  of  dimensions  (parameters)--this  makes  multi¬ 
parameter  HT  impractical.  One  approach  to  this  problem 
is  to  quantize  the  parameter  space  dynamically  [Sloan 
1981;  O  Rourke  1981].  Another  idea,  the  subject  of  this 
report,  is  the  cache-based  HT,  which  uses  a  fixed  (small) 
content  addressable  store  (in  software,  a  hash  table;  in 
hardware,  a  cache)  to  accumulate  votes. 

To  study  the  properties  of  the  cache-based  HT,  it  is 
helpful  to  have  an  abstract,  domain-free  characterization  of 
the  HT  vote-generation  process.  Existing  applications  (e.g., 
(Brown  and  Curtiss  1982])  and  theory  (e.g.,  (Shspiro  1975, 


1978a,  1978b;  Brown  1982])  suggest  considerations  that 
should  be  taken  into  account  by  an  abstraction.  The 
abstraction  may  be  analyzed  or  simulated  with  various 
cacheing  schemes  to  derive  a  model  of  cache  based  HT 
performance.  Then,  existing  applications  (such  as  our 
shape,  motion,  and  illuminant-direction  computations) 
may  be  used  to  test  model  predictions. 

Some  considerations  for  a  general  model  of  the  cache- 
based  HT  process  appear  in  Sections  2  and  3,  and  Section 
4  gives  an  abstraction  of  the  voting  process.  Section  5 
outlines  some  possible  analytical  approaches.  Section  6 
presents  results  of  pilot  studies  on  cache  behavior.  Figure  1 
is  a  diagram  of  the  system  for  analysis  of  sequential 
behavior  of  HT  schemes. 

«Figure  1» 

2.  Informal  Abstract  Hough  Transform 

This  section  lists  considerations  that  govern  the 
behavior  of  any  HT  like  process  emitting  votes  through 
time  into  an  accumulator.  Such  an  abstract  process  is 
called  a  voting  machine  in  Section  4. 

In  a  noiseless  HT,  one  feature  point  in  A  produces  a 
set  H  of  votes  that  generally  are  weighted  points  in  the 
parameter  space  B  (so  a  vote  is  a  vector  (w,  b),  with  w  a 
weight  and  b  in  B.  (We  assume  here  that  the  HT  is 
"tuned"  to  one  parameter  transformation,  to  detect 
instances  of  one  shape,  etc.)  In  one  limiting  case,  the  set  H 
is  unbounded  (as  when  a  point  on  a  line  votes  for  an 
infinite  line  of  points  in  (slope.intercept)  space).  In 
another  limiting  case,  H  is  a  singleton  (as  in  the  work  of 
Shapiro  [1975,  1978a,  1978b]).  In  either  case,  for  noiseless 
HT  we  assume  that  H  from  a  feature  point  in  A  contains 
exactly  one  vote  for  the  "true"  parameters  in  B  that 
characterize  the  instance  in  A.  Let  us  call  the  true 
parameter  value  the  peak  or  mode  f6r  the  instance,  since 
that  is  where  the  plurality  (mode)  of  votes  for  it  should  be. 
H  is  the  feature  point  spread  function  (fpsf)  of  [Brown 
1982],  In  all  that  follows,  we  assume  accumulation  into  a 
quantized  version  of  B,  so  H  =  fpsf  is  a  discrete  set  Of 
course  H  may  be  a  complicated  function  of  the  parameter 
values  and  locations  in  the  originating  parameter  space  A; 
H  =  fpsf  =  fpsftx,  A(x)).  As  A  is  scanned,  individual 
feature  points  cause  votes  to  be  emitted.  The  summation 
of  all  H  from  an  instance  into  the  accumulator  array  is  the 
parameter  point  spread  function  (ppsf)- 
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(Consideration  2.1)  Instances  in  an  image  consist  of 
several  feature  points,  extend  across  several  scan  lines,  and 
are  spatially  coherent.  There  may  be  more  than  one 
feature  on  a  scan  line.  Instances  may  overlap. 

(Consideration  2.2)  The  image  can  be  scanned  in  any 
of  several  ways:  e.g.,  left-right,  top-bottom;  pixels  at 
random  (with  and  without  replacement):  left-right  across 
scan  lines  in  random  ortler;  left-right  across  interlaced  scan 
lines. 

(Consideration  2.3)  Votes  come  in  bursts  of  length 
||fpsf(x,  A(x))||  =  ||H||,  each  generated  by  a  single  feature 
vector  in  A. 

(Fact)  With  one  instance  and  no  noise,  successive 
bursts  all  come  from  the  single  instance.  With  multiple 
instances  and  no  noise,  successive  bursts  may  or  may  not 
come  from  the  same  instance.  1'he  likelihood  that  they  do 
depends  on  assumptions  about  the  distribution  of  instances 
in  the  image  (do  they  overlap?)  about  the  distribution  of 
feature  points  in  an  instance  (is  there  more  than  one  per 
scan  line?),  and  about  the  scanning  pattern.  (The  basic 
question  is  whether  two  features  of  the  same  instance  are 
encountered  before  another  instance.) 

(Consideration  2.4)  Only  one  (weighted)  vdte  in  H  is  a 
vote  for  the  true  instance  parameters  (for  the  peak  or 
mode).  '  • 

(Consideration  2.5)  A  basic  question  in  HT  is  its 
performance  in  noise.  H’s  from  noise  points  are 
interleaved  with  H’s  from  instances.  There  are  several 
noise  phenomena  with  differing  characteristics  and 
implications.  First,  there  is  statistical  noise.  One  form  is 
signal-independent  image  degradation.  This  process 
establishes  a  kind  of  baseline,  since  its  random  nature 
makes  it  one  of  the  least  pernicious  forms  of  noisa  in  the 
cache-based  HT.  One  component  of  degradation,  perhaps 
worth  singling  out,  is  that  component  consisting  of  features 
not  associated  with  any  instance  in  A.  This  component 
adds  extraneous  H’s,  but  does  not  affect  H’s  from  true 
instance^. 

Another  form  of  noise  is  errors  in  computing  "image 
features,”  i.e.,  the  feature  vectors  in  A.  These  errors,  as 
well  as  location  and  quantization  noise  in  A,  may  be 
modeled  as  certain  random  processes. 

The  most  pernicious  form  of  noise,  and  one  not 
treated  here,  is  systematically  misleading  instance 
information.  For  instance,  extreme  and  systematic  feature 
dropout  can  arise  from  occlusion.  Especially  troublesome 
are  ..near-miss  instances  (many  features  in  common  with 
instances  of  interest).  The  problem  is  that  in  the  cache- 
based  HT,  it  is  important  to  create  and  retain  peaks  for  the 
instances.  These  peaks  are  threatened  by  competition  from 
other  growing  peaks,  such  as  arise  in  near-miss  instances. 

3.  Informal  Abstract  Cache 

A  hash  table  or  cache  accepts  one  element  at  a  time,  is 
content-addressable,  and  garbage  collection,  or  flushing, 
I  happens  when  its  finite  length  is  exhausted.  Unlike  a  cache 
for  memory  management,  the  HT  cache  is  not  meant  to 


track  a  dynamic  situation,  but  to  function  as  a  smart 
accumulator." 

Our  cache  is  a  data  structure  and  associated  operations 
for  management  of  a  set  S  from  the  space  of  weighted 
vectors  from  B.  Thus  elt  =  (w,  b)  for  some  w  €  Rea  (or 
w  e  Integers)  and  b  €  B.  A  cache  has  a  parameter  called  its 
length -ihe  number  of  elements  it  can  hold.  As  do  most 
abstract  data  types  dealing  with  sets,  a  cache  has  basic 
initialization  and  housekeeping  operations,  as  well  as 
operations  of  the  following  flavor  (these  are  for  exposition, 
not  practicality). 

Full(cache):  TRUE  if  cache  full,  else  FAl.SE. 

FindVector(searchelt,cache,  (®foundelt):  sets  foundell 
to  unique  element  whose  vector  agrees  with  that  of 
searchelt  and  returns  TRUE,  or  returns  FALSE  if 
no  such  element  is  in  cache. 

FindVote(weight,  cache,  @foundeltSet):  sets 
foundeltSet  to  set  of  elements  whose  weight  agrees 
with  (or  is  less  than  or  equal  to)  weight  and  returns' 
TRUE,  or  returns  FALSE  if  no  such  element  is  in 
cache. 

FindMax(cache,  @foundelt):  sets  foundell  to  the 
element  (or  one  of  the  elements)  with  maximum 
weight. 

Removefelt,  cache):  removes  elt  from  cache. 

Insert(elt,  cache):  adds  elt  (not  in  cache)  to  non-Full 
cache  (see  Replace). 

AddWeight(elt,  cache):  does  nothing  but  return 
FALSE  if  no  element  matching  elt’s  vector  is  in 
cache.  Otherwise  it  increments  the  matching 
vector’s  weight  by  the  weight  of  elt  and  returns 
TRUE 

There  are  three  high  level  cache  management 
operations:  Vote,  Flush,  and  Interpret. 

Vote(elt,  cache):  Vote  uses  AddWeight  to  increment 
the  weight  (vote  count)  of  an  existing  element,  or 
else  Inserts  element  elt  into  the  cache  if  there  is 
room,  or  as  a  last  resort  invokes  a  more  or  less 
complicated  method  of  dealing  with  a  full  cache, 
usually  involving  Flush  and  possibly  re-trying  the 
Insert. 

Flush(cacfre):  Flush  creates  room  for  more 

elements.  Its  design  is  a  focus  of  current  research 
(Section  6).  Algorithms  with  hardware 
implementation  are  especially  of  interest. 
Interpret(cache):’  ’  Interpret  is  to  identify  "peaks" 
in  accumulator  space  B,  which  it  is  hoped 
correspond  to  (are  the  parameters  of)  instances  in 
the  input  space  A.  If  the  input  contains  only  a 
single  instance,  the  natural  Interpret  returns  the 
element  of  maximum  weight  (most  votes).  For 
multiple  instances,  especially  an  unknown  number 
of  them,  Interpret  presents  interesting  problems, 
and  is  a  focus  of  current  research.  As  with  Flush, 
algorithms  with  a  natural  hardware  implementation 
are  especially  of  interest. 

(Consideration  3.1)  The  Cache  has  a  fixed  length.  If 
that  is  infinite,  the  cache  acts  like  the  accumulator  array. 
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(Consideration  3.2)  The  Cache  is  often  most  easily 
addressed  as  a  linear  data  structure,  with  vote  vectors 
transformed  to  offset  integers.  Addressing  modes  may  vary 
in  hardware  implementations. 

(Consideration  3.3)  There  must  be  a  cache-flushing, 
garbage-collection  strategy  for  dealing  with  overflow.  The 
obvious  basic  strategy  is  to  flush  entries  with  fewest  votes. 
There  are  many  variations,  including  using  recency 
information  (traditional  for  LRU  caches)  and  geometric 
locality  information  (implemented  through  a  hierarchical 
cache  arrangement). 

(Remark)  Lor  applications,  flushing  strategy  may 
reflect  considerations  of  hardware  implementation. 

(Remark)  The  interaction  of  CHough  [Brown  1982, 
Brown  and  Curtiss  1982]  with  cache-based  HT  raises 
interesting  questions,  since  it  involves  negative  votes. 

4.  A  General  Voting  Machine 

A  geometric  imaging-like  situation  is  an  easy  way  to 
describe  the  voting  machine,  and  in  fact  the  most 
straightforward  implementation  probably  is  as  shape 
detection  applied  to  an  image-like  array.  A  description  as 
some  form  of  stochastic  FSM  would  work  equally  well, 
but  isn’t  as  visualizable. 

The  abstract  image  is  a  two-dimensional  array  A(x)  of 
vector  entries  corresponding  to  instances  and  to  noise.  The 
voting  machine  scans  this  array,  and  for  each  feature 
vector  it  encounters  it  emits  an  H  =  fpsf(x,A(x)). 
Houghing  for  a  global  parameter  (like  sun  angle)  is 
modeled  by  an  A(x)  with  only  one  instance.  Houghing  for 
multiple  instances  is  possible.  Noise  (Consideration  2.5) 
may  be  modeled. 

The  following  technical  decisions  arose  from  geometric 
encodings  of  the  considerations  in  Section  2  and  from 
[Brown  1982], 

1)  let  an  instance  be  one  or  more  vertical  lines  (called 
parts)  in  a  horizontal  row,  separated  by  empty  columns. 
This  captures  all  facets  of  Consideration  2.1. 

2)  Various  forms  of  scanning  (Consideration  2.2)  are 
implemented  by  accessing  A  in  different  orders. 

3)  The  fpsfs  for  an  instance  should  look  like  diameters 
of  a  circle  (Figure  2).  The  HT  of  an  instance  (the  intents 
of  the  accumulator  array)  will  then  be  a  radial  rosette  of 
lines  all  passing  through  a  central  point,  the  peak.  This 
decision  is  based  on  Considerations  2.3  and  2.4,  and  the 
fact  that  this  choice  reproduces  the  ubiquitous  near-peak 
1/r  falloff  of  parameter  psfs  in  all  dimensions  [Brown 
1982], 

«Figure  2» 

4)  Noise  can  take  various  forms  (Consideration  2.5). 
Missing  instance  points  can,  if  systematic,  le?d  to  non 
radially  symmetric  ppsfs.  Noise  feature  vectors  can  be 


added  into  the  A(x),  and  of  course  instance  feature  vectors 
can  be  perturbed  by  noise  processes.  In  the  experiments 
reported  here,  we  had  no  massive  dropout  (modeling 
occlusion),  getting  a  similar  effect  by  varying  the  number 
of  parts  per  instance.  We  did  not  attempt  to  address 
questions  of  near-miss  instances  at  this  stage.  For  this 
report  we  used  the  baseline  case  of  signal-independent 
image  degradation. 

The  variables  that  characterize  the  abstract  voting 
machine  are: 

VI)  Itfpsfll 

V2)  number  of  instances  and  their  location 
V3)  number  of  parts/instance 
V4)  noise  parameters  of  three  types 
V5)  scanning  method 

Figure  3  shows  an  A(x)  with  two  2-part  instances  with  and 
without  noise.  Each  part  is  9  features  long;  there  are 
eighteen  different  features. 

«Figure  3» 

5.  Analytic  Approaches 

1)  Hypothesis  Testing:  After  the  HT,  the  accumulator 
holds  a  distribution  of  votes.  After  cacheing  the  same  data, 
the  cache  is  meant  to  mirror  certain  characteristics  of  the 
accumulator's  distribution.  In  particular,  it  is  meant  to 
have  the  same  modes  (local  maxima).  One  could  proceed 
with  something  like  a  statistical  hypothesis  testing 
approach,  with  the  null  hypothesis  being  that  modes  in  the 
cache  correspond  to  modes  in  the  accumulator.  The  idea 
of  course  is  to  measure  the  probabality  of  Type  I  and  Type 
II  errors  (falsely  rejecting  and  falsely  accepting  the  null 
hypothesis),  given  some  sample  slatistic-in  our  case,  the 
cache  contents.  Usually,  we  choose  probabilities  a  and  p 
for  these  errors  that  are  small  enough  for  our  purposes. 
When  things  go  well  (i.e.,  in  textbooks)  this  approach 
yields  a  critical  (rejection)  region  of  values  of  the  test 
statistic  for  which  we  should  reject  the  null  hypothesis  if 
v.e  are  to  maintain  our  standards  for  error  probabilities. 
There  are  non-parametric  tests  for  things  like  the  sameness 
of  median  between  two  distributions. 

What  goes  wrong  immediately  here  is  that  the 
sampling  performed  by  the  cache  is  extremely  structured. 
Even  if  we  were  to  attempt  to  compute  with  the  known 
relevant  distributions  [Brown  1982],  the  intricacies  of  this 
sampling  seem  very  resistant,  to  analysis. 

A  generalization  of  the  hypothesis  testing  approach  is  a 
decision  theory  approach,  that  allows  a  more  flexible  cost 
function  to  be  associated  with  decisions.  The  appealing 
thing  there  is  perhaps  that  it  can  more  respectably  be 
based  on  empirical  data.  Sequential  decision  theory  seems 
possibly  more  relevant,  though  less  well  understood. 

2)  Page  Replacement,  etc.:  There  is  a  long  tradition  of 
work  (e.g„  the  analysis  of  page  replacement)  aimed  at 
characterizing  cache  behavior.  Almost  certainly  none  of 
that  work  is  directly  relevant,  but  it  may  provide  useful 
analytical  tools  and  abstractions. 
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3)  Geometry  and  Probability:  lest  (this  seems  most 
promising),  the  particular  abstraction  chosen  to  model 
cache- based  HT  might  be  analyzed  with  very 

straightforward  probability  and  statistical  tools.  The  model 
proposed  in  Section  4  is  very  geometrical  in  flavor, 

corresponding  to  throwing  line  segments  sequentially 

down  on  the  plane  and  analyzing  their  accumulating 
statistics  of  overlap.  It  may  well  be  that  basic, 

geometrically  based  probability  theory  is  directly  relevant. 

6.  Results 
6.1  Overview 

We  implemented  the  abstract  voting  machine  of 
Section  4  and  a  software  simulation  of  a  cache  that  is 
parameterizable  in  several  ways,  including  length  and 
flushing  strategy.  The  cache  is  instrumented  for  various 
forms  of  statistical  analysis.  The  results  are  presented  in 
tables  and  figures  that  follow  this  section.  They  are 
preceded  first  by  a  high  level  overview  of  the  experiments, 
then  a  more  detailed  explanation  of  the  parameters 
involved. 

We  ran  a  set  of  experiments  to  explore  several 
parameters  (both  of  the  cache  and  of  the  abstract  image) 
governing  sequential  HT  performance. 

Fixed  Parameters,  Flush  Algorithm  (Threshold) 
Number  of  Instances  (1) 

Part  length  (11) 

Varying  Parameters:  Cache  length  (32,  64,  128,  oo) 
Number  of  Parts  (1,  2,  3) 
Noise  (0,  15%,  30%) 

Fpsf  Length  (3,  9,  15  cells) 

Scanning  method  (five  methods) 

The  results  of  these  experiments  are  summarized  in 
Tables  1-4,  which  show  a  "peak/background  ratio" 
measure  and  its  variation  for  each  case.  (Tables  1,  3,  and  4 
appear  in  (Brown  and  Sher,  1982).)  The  ratio  is  computed 
as  follows. 

weight  of  vector  describing  instance 

Ratio  =  .  (1) 

maximum  weight  of  all  other  vectors 


«Table2» 

T  hus  a  ratio  of  3.00  means  that  the  single  parameter 
vector  correctly  describing  the  instance  is  three  times  as 
high  as  its  nearest  competitor.  A  ratio  of  unity  means  there 
is  a  tie.  Any  ratio  greater  than  unity  means  the  correct 
peak  would  be  found  by  simply  choosing  the  highest  peak. 
A  ratio  less  than  unity  means  such  simple  thresholding 
finds  a  vecto’-  giving  an  incorrect  instance  description. 

We  expanded  a  subset  of  these  results  in  more  detail. 


Fixed  Parameters:  Flush  Algorithm  (as  above) 

Number  of  Instances  (as  above) 
Part  length  (as  above) 
Number  of  parts  (1) 

Varying  Parameters:  Cache  length  (as  above) 

Noise  (15%,  30%) 

Fpsf  length  (as  above) 
Scanning  method  (two  methods) 

The  expanded  results  are  shown  in  Figure  4,  which 
gives  the  histograms  of  the  peak/background  ratio  whose 
means  and  standard  deviations  appear  in  Tables  1-4. 
From  the  histograms  it  can  be  seen,  for  instance,  what 
fraction  of  the  time  an  instance  would  not  be  detected  by  a 
simple  thresholding  version  of  the  Interpret  function. 

«Figure4» 

Finally,  Figure  5  shows  histograms  like  those  of  Figure 
4.  Here  the  main  focus  is  on  the  Flush  algorithm  as  cache 
length  and  noise  level  vary. 

«Figure  5» 

6.2  Details 

The  choice  of  which  parameters  to  vary  and  interesting 
values  for  them  was  determined  by  a  significant  amount  of 
prior  experiment.  The  size  of  the  abstract  image  is 
irrelevant,  being  redundant  with  other  parameters  such  as 
noise  densiity.  In  fact  the  image  was  20  x  20,  and  the 
single  instance  was  located  in  its  center. 

Flush  Algorithms:  We  used  Threshold  Flush 
exclusively  in  the  main  experiment.  It  simply  raises  a 
threshold  for  the  vole  count  (weight)  of  elements  until  it 
finds  elements  whose  weight  is  at  threshold,  all  of  which  it 
flushes.  For  the  last  experiment,  we  compared  Threshold 
Flush  with  Random  Below  Threshold  Flush,  which  simply 
finds  an  element  at  random  whose  weight  is  in  the  bottom 
nth  percentile  of  weights  and  Removes  that  single  element. 
In  this  experiment  the  doomed  element  was  picked  at 
random  from  the  bottom  third  of  weighted  elements. 

Number  of  Instances:  Only  one  instance  appears  in 
images  for  these  experiments.  This  is  an  important 
case,  and  fairly  terse  statistics  describe  the  efficacy 
of  the  simple-threshold  Interpret  algorithm. 
Multiple  instance  performance  is  a  matter  for 
future  research  (also  see  (Brown  and  Curtiss  1982)). 

Number  of  Parts:  One-,  two-,  and  three-part  instances 
generate  (in  the  noiseless  case)  that  number  of 
feature  points  per  horizontal  scanltne. 

Part  length:  Part  length  11  means  the  instance 
extends  across  11  horizontal  scan  lines  The  total 
"signal  strength”  for  the  instances  is  11,  22,  and  33 
votes. 

Cache  length  (32,  64,  128.  infinite):  The  infinite  cache 
is  a  pure  accumulator  array.  This  variable  can  to 
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some  extent  be  traded  off  against  Fpsf  Length.  It  is 
worth  mentioning  that  commercially  available 
hardware  has  cache  lengths  on  the  order  of 
thousands  of  words. 

Noise  (0%,  15%,  30%):  These  experiments  use  a  noise 
model  that  mimics  signal-independent  image 
degradation.  A  random  feature  vector  overwrites 
an  abstract  image  point  (instance  or  background) 
with  the  indicated  probability. 

Feature  point  spread  function  (fpsf)  Length  (3,  9,  15 
cells):  Short  fpsfs  produce  less  inherent  noise 
(fewer  sidelobes)  (Brown  1982],  For  example, 
consider  line  detection:  As  bet  ter1  and  better  edge 
gradient  information  is  incorporated  to /educe  the 
possible  range  of  slope  attributed  to  the  line,  the 
fpsf  shrinks.  Inherent  noise  is  not  a  problem  in 
single-instance  images,  but  the  effects  of  the 
statistical  Noise  are  also  felt  at  greater  distance 
with  larger  fpsfs,  and  that  accounts  for  the 
performance  degradations  nolicable  here.  9-cell 
fpsfs  were  used  in  Fig.  2b. 

Scanning  method:  We  use  five  methods  to  scan  the 
abstract  image.  The  first  four  hit  each  image  point 
exactly  once,  the  last  is  not  guaranteed  to.  The  fust 
two  are  deterministic,  the  last  three  are  random. 
The  first,  fourth,  and  fifth  might  be  easily 
implemented  in  hardware. 

1)  1  eft  to  Right  within  Top  to  Bottom:  as  in  a 
non-interlaced  TV  scan. 

2)  Top  to  Bottom  within  Left  to  Right:  interesting 
because  it  causes  several  fpsfs  from  the 
instance  to  arrive  almost  successively. 

3)  Random  without  replacement:  implemented  by 
accessing  the  400  elements  of  the  array  in 
permuted  order. 

4)  left  to  Right  within  randomly  permuted 
scanlines. 

5)  Random  with  replacement:  Access  n-  elements 
of  the  nxn  image  array  at  random,  very  likely 
missing  some  entirely  and  consequently  hitting 
others  more  than  once. 

N:  There  are  two  sources  of  randomness:  One  arises 
from  noise,  one  from  the  random  scanning 
methods.  We  ran  ten  different  noisy  images  into 
each  scanning  method.  The  random  methods  each 
produced  ten  different  orders  of  scan,  the  scanline 
methods  each  produced  one,  which  was  repeated 
ten  times.  Thus  in  the  finite  cache  experiments, 
the  N  for  the  two  non-random  scanning  methods 
(1  and  2)  is  10,  and  for  the  three  random  scanning 
methods  (3,  4,  5)  is  100.  In  the  infinite  cache,  the 
four  methods  (1,2,3,4)  that  hit  each  dement  once 
are  equivalent,  so  are  collapsed  into  one  case. 

6.3  Conclusions  and  Future  Work 

How  should  these  results  be  interpreted?  To  apply 
them  to  any  particular  HT  situation  (but  so  far  only  to 
detection  of  a  single  instance  or  derivation  of  a  parameter 
vector  for  a  single  phenomenon,  and  only  with  signal- 
independent  noise),  analyze  it  in  terms  of  scanning 
algorithm,  number  of  feature  points,  noise  level,  fpsf 


length  and  so  forth.  Interpolate  or  extrapolate  as  necessary. 
Remember  that  the  important  basic  process  is  the 
succession  of  votes  emerging  from  the  HT  (its  mix  of 
feature  and  noise  elements).  The  explicit  conversion  from 
any  given  HT  situation  to  parameters  of  our  model  will  be 
addressed  in  more  detail  in  a  future  report  (or  a  later 
edition  of  this  report). 

The  infinite  cache  (a  pure  accumulator  array)  is  not 
dramatically  better  than  the  finite  caches  under  the 
circumstances  of  these  experiments.  Thus  finite  caches 
seem  to  be  practical  HI'  accumulators. 

The  tables  and  histograms  show  a  graceful  and 
qualitatively  predictable  degradation  of  performance  as 
noise  and  fpsf  length  increase  and  cache  length  decreases. 

The  random  (with  replacement)  scan  was  uniformly 
significantly  worse  in  all  cases  than  other  methods.  The 
random  (without  replacement)  does  not  seem  significantly 
better.  In  fact,  the  results  are  (perhaps  surprisingly) 
insensitive  to  scanning  method,  as  long  as  each  input  space 
point  is  hit  once. 

Even  in  30%  noise,  the  fop  to  Bottom  within  Left  to 
Right  scanning  method  was  not  better  than  the  Left  to 
Right  within  Top  to  Bottom.  T  his  indicates  that  successive 
bursts  of  votes  (fpsfs)  containing  votes  for  the  "true 
parameter"  are  not  enough  to  help  the  peak  establish 
itself.  Probably  the  noise  that  arrives  before  the  instance 
and  the  flush  algorithm  dominate  the  peak  establishment 
process. 

Several  of  the  histograms  show  a  significant  bimodality 
with  one  peak  at  0.  This  indicates  that  the  "hue  peak” 
never  gets  established  in  a  significant  number  of  cases,  and 
that  some  pre  processing  might  help  dramatically  in 
establishing  peaks.  One  idea  is  "pair  Houghing,"  in  which 
only  vectors  that  appear  in  two  fpsfs  are  allowed  into  the 
cache.  Clearly  without  such  preprocessing  the  issue  is 
whether  peaks  are  able  to  survive  in  the  cache,  and  this 
depends  to  a  large  extent  on  the  Flush  algorithm.  Instances 
found  early  in  the  scan  might  have  an  advantage  in  some 
flushing  strategies,  while  other  strategies  might  penalize 
peaks  that  have  been  in  the  cache  longer.  We  reasoned 
that  the  simple  Threshold  Flush  might  be  systematically 
flushing  nascent  peaks  before  they  got  established,  thus 
accounting  for  the  significant  number  of  0 
peak /background  ratios. 

Jerry  Feldman  suggested  the  Random  Below 
Threshold  Flush  to  replace  the  systematic  "slaughter  of 
innocents"  imposed  by  Threshold  Flushing.  The  result  is  a 
lottery-like  system  that  selects  from  a  range  of  vote 
weights,  allowing  some  small  peaks  to  survive.  As  shown  in 
Figure  5,  this  method  makes  for  a  significant  improvement 
in  mean  peak/background  ratio,  and  reduces  the  peak  at  0. 

Systematic  distortion,  near-miss  instances,  and 
occlusion  pose  thorny  problems  for  foe  cache-based  HT. 
We  emphasize  that  many  of  the  conclusions  of  this  report 
and  all  the  quantitative  data  are  based  on  a  statistical, 
signal-independent  noise  model.  An  important  topic  for 
futu.e  work  is  to  assess  foe  performance  of  cache  HT 
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schemes  in  these  more  stringent  noise  conditions  and  to 
develop  means  of  improving  such  performance. 

In  CHough  methods  [Drown  1982,  Brown  and  Curtiss 
1982],  the  fpsf  contains  negative  as  well  as  positive  votes. 
CHough  interacts  in  an  interesting  way  with  flushing 
strategies.  Future  work  will  address  these  issues. 

Hierarchical  caches  can  incorporate  a  form  of 
"geometric  locality"  into  flushing.  Natural  flush  algorithms 
for  a  single-level  cache  are  based  str.ctly  on  weights. 
Multiple  caches  in  a  resolution  pyramid,  maintained  in 
parallel,  can  be  used  to  restrict  flushing  to  coherent 
localities  in  the  input  space  A.  This  idea  is  another  matter 
for  future  research. 
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Figure  1:  The  cache  based  Hough  transform  analysis 
system. 
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Figure  2(a  b):  (a)  each  straight  line  is  a  (continuous) 

feature  point  spread  function  (fpsf)  showing  the  effect 
(votes)  in  space  B  of  the  features  in  space  A  encoded 
ss  the  associated  integers,  (b)  In  a  digital 
implementation,  thirteen  9-cell  fpsfe  are  superimposed1 
to  give  a  parameter  point  spread  function  (ppsl) -thf 
image  in  parameter  space  of  a  13-feature  instance.  The 
peak  corresponding  to  the  instance  is  ir.  the  center. 
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figure  3(a-b):  (a)  Two  18-fealure,  two-part  ins'-nces  in  a 
20x20  abstract  image,  (b)  The  instances  ot  (a)  with 
15%  signal-independent  noise. 


F  pif  tan:  3 

0 

15 

1  > 

ft 

15 

3 

ft 

15 

1 

Avg: 1) 

2.75 

2  75 

2  75 

1  *  31 

1  52 

1.29 

1  66 

0.35 

0.50 

StOav: 

0  DO 

0.00 

0.00 

1  0.50 

0  65 

0.77 

0.46 

0.56 

0.56 

N: 

10 

10 

10 

1  to 

10 

10 

10 

10 

10 

Avg:“f 

2  75 

2  75 

2  75 

1  2.30 

1  6? 

1.61 

1.75 

1.01 

0  6ft 

SlDav 

0.00 

0.00 

0.00 

|  0.46 

0.32 

0.35 

0.26 

0.33 

0.51 

N: 

10 

10 

10 

1  io 

10 

10 

10 

10 

10 

*vg:p) 

2.75 

2  75 

3.03 

|  2.47 

1.76 

1.4’ 

1.66 

0.65 

0.69 

StDav: 

0.00 

0.00 

0.42 

|  0.56 

0.45 

0.65 

0.52 

0.52 

0.62 

ft: 

100 

too 

100 

|  100 

100 

100 

100 

100 

100 

Avg:r) 

2.37 

2.22 

2.35 

|  1.5* 

1  .20 

0 .  BO 

1.11 

0.55 

0.65 

StOav  ; 

0.61 

0.47 

0.52 

I  0.51 

0.48 

0.56 

0.61 

0.51 

0.53 

[ 

N: 

100 

100 

100 

|  too 

100 

100 

100 

100 

100 

Avg  lp) 

7.75 

2.75 

3  01 

|  2.44 

1  .67 

1.06 

l.|7 

0.55 

0.72 

StOev  : 

0.00 

0.00 

0.41 

|  0.55 

0.65 

0.64 

0.55 

0.66 

0.6ft 

ft: 

100 

too 

too 

|  100 

100 

too 

100 

100 

Miaaber 

of  pa  r  l  i  :  2 

Noll*  It 

OX 

|  16*. 

1 

301 

1 

F pi f  tan:  3 

ft 

15 

|  3 

0 

15 

1 

3 

6 

16 

1 

Avg:  D 

3  14 

3.14 

3.14 

|  2  75 

2.53 

2.3ft 

1 

2.21 

2.1ft 

1  66 

1 

StDav  : 

0.00 

0.00 

0.00 

1  0.30 

(1 .32 

0.46 

I 

0.32 

0.46 

0.46 

1 

ft: 

10 

10 

10 

1  io 

10 

10 

1 

10 

10 

10 

1 

*«9  u) 

3  14 

3  14 

3  14 

|  2.70 

2.53 

2.51 

1 

2  36 

1.66 

1.67 

, 

StDav : 

0.00 

0  00 

0.00 

|  0.25 

0.2? 

0.40 

1 

0.42 

0.31 

0.64 

1 

ft  : 

ID 

to 

10 

1  >0 

to 

10 

1 

10 

10 

10 

1 

Avj:p) 

a  m 

3.14 

3.1ft 

|  2.61 

2.65 

2.40 

1 

2.47 

2.03 

1  64 

SlDav: 

0.00 

1)  00 

0  17 

I  0.35 

0.46 

0  56 

1 

0.56 

0.47 

0,57 

1 

ft: 

100 

100 

100 

|  100 

100 

100 

1 

100 

100 

100 

1 

Avg : r  ) 

2  55 

2.63 

2.16 

|  ?  .31 

2.15 

1.65 

1 

1.62 

1.43 

1.17 

1 

SlDav: 

0.40 

D .  42 

0.49 

|  0.4J 

0.50 

0.51 

1 

0.32 

0,56 

0.5ft 

1 

ft: 

100 

100 

>00 

|  100 

100 

100 

1 

100 

ICO 

1U0 

1 

Avg:  ip) 

3.  14 

a.  14 

3.22 

|  2.60 

2.62 

2. 56 

1 

2.3ft 

2.15 

1.74 

1 

SlDav: 

0.00 

0.00 

0.21 

|  0.36 

0.55 

0.68 

I 

0.46 

0.25 

0.55 

1 

ft: 

100 

100 

100 

(  too 

too 

100 

1 

100 

100 

100 

1 

ftiaaba  r 

9f  parti:  3 

ftoH*  X 

OX 

|  (5X 

m 

Fptf  lan:  3 

ft 

15 

|  3 

ft 

15 

3 

ft 

15 

Avg:  1) 

3.30 

a  30 

3  30 

|  2  04 

2.75 

2.66 

2.86 

2.66 

2.35 

StOav  : 

0  UO 

o.no 

0.00 

|  0  It 

0.20 

0.46 

0.31 

0.37 

0.46 

ft: 

10 

10 

10 

1  io 

10 

10 

10 

10 

10 

Avg : u ) 

3  30 

3.30 

3.30 

|  2.91 

2  66 

2.67 

2.74 

2.43 

1.12 

StOav: 

0.00 

0.00 

0.00 

I  0.16 

0.30 

0.47 

0.26 

0.17 

1.15 

II: 

10 

10 

10 

|  10 

10 

10 

10 

10 

10 

Avg.p) 

3.30 

3.31 

3  42 

I  2.65 

2.77 

2.71 

v  .66 

2.67 

2.27 

StOav: 

0.00 

0.05 

0  16 

|  0.21 

0  34 

0.42 

0.34 

0.36 

0.4ft 

ft: 

100 

100 

100 

|  100 

100 

100 

100 

100 

100 

Avg: r) 

2.ft3 

2.64 

2.16 

'  2.54 

2.27 

2.26 

2.29 

2.10 

1.77 

StOav: 

0.40 

0.37 

0.44 

|  0  43 

0.46 

0.54 

0.47 

0.46 

0.65 

ft: 

100 

100 

100 

|  100 

100 

100 

100 

."I 

100 

Avg: Ip) 

a.  an 

3.30 

3  46 

|  2.95 

2.ftl 

2.67 

2.94 

2.60 

2. 35 

StOav . 

0.00 

'•  00 

0.29 

|  U.23 

0.49 

0.62 

0.42 

0.46 

0.30 

ft  • 

](•>. 

toil 

Hit’ 

1  too 

100 

100 

100 

luu 

too 

Tables  1,  2,  3,  4:  (Only  Table  2  is  included  in  DARPA-IU 
proceedings).  Table  rows  are  scanning  methods:  I) 
I. eft  to  Right  within  Top  to  Bottom,  u)  Top  to  Bottom 
within  Left  to  Right,  p)  Random  without  replacement, 
r)  Random  with  replacement.  Ip)  I.eft  to  Right  within 
permuted  scanlines.  Table  entries  are  mean  and 
standard  deviation  of  peak/background  raUo  (eq.  1). 
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Cache  length:  32.  Scan;  Random.  Nolle:  15**  Cache  length:  64.  Scan:  Random. 
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Figure  4{a-p):  (Only  (e)  and  (0  are  included  in  DARPA- 
IU  proceedings).  Histograms  of  peak/background 
ratios  from  a  subset  of  cases  in  Tables  1-4. 
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Figure  5(a-d):  (Only^a)  and  (b)  are  included  in  DARPA- 
IU  proceedings).  Comparison  of  Iwo  Flush  algorithms. 
Histograms  as  in  Fig.  4. 
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This  paper  introduces  a  contour-based 
approach  to  motion  estimation.  It  is  based  on 
first  computing  motion  at  image  corners,  and  then 
propagating  the  corner  motion  estimates  along  the 
principal  contours  in  the  image  based  on  a  local 
2=jD  motion  assumption.  The  results  of  sever.  1  ex¬ 
periments  are  presented. 

1 .  Introduction 

The  earliest  problem  that  arises  in  the  analy¬ 
sis  of  time-varying  images  is  the  detection  of 
moving  image  elements  (edge,  regions)  and  the  com¬ 
putation  of  the  image  velocity  (optic  flow)  of 
those  elements.  A  variety  of  computational 
schemes  have  been  proposed  to  solve  this  problem. 

In  a  recent  survey,  Ullman  [1]  broadly  classifies 
these  as  intensity-based  and  token-matching 
schemes. 

An  important  class  of  intensity-based  schemes 
takes  advantage  of  the  relationship  between  the 
temporal  and  spatial  gradient  of  any  continuous 
and  differentiable  image  property  which  is  invari¬ 
ant  to  small  changes  in  perspective.  For  example, 
if  we  assume  that  the  intensity,  I,  satisfies 
these  properties,  the  relationship 

-I  *  ui  +  vl  (1) 

t  x  y 

can  be  used  to  determine  velocity.  Here  It  is  the 
temporal  intensity  gradient,  Ix  and  I  the  x  and 
y  components  of  the  spatial  intensity  gradient, 
and  u  and  v  the  x  and  y  components  of  image  veloc¬ 
ity.  Measuring  It,  Ix,  1„  from  an  image  sequence 
establishes  a  linear  constraint  on  the  x  and  y 
velocity  components.  A  single  velocity  estimate 
can  be  computed  by  spatially  combining  the  con¬ 
straints  using  e.g.,  Hough  transforms  [2],  least- 
squares  methods  [3]  or  minimization  techniques  (4}. 
All  of  these  techniques  suffer  from  certain  dis¬ 
advantages.  The  Hou -’h-transform  and  minimization 
techniques  assume  that  image  velocity  is  uniform 
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over  large  parts  of  the  image,  and  the  least- 
squares  method  further  assumes  that  the  constraint 
equations  determined  for  nearby  points  are  inde¬ 
pendent  -  as  assumption  that  is  violated  by  the 
spatial  integration  required  to  compute  spatial 
derivatives. 

In  this  paper  we  develop  a  contour-based 
approach  to  motion  estimation  at  a  small  set  of 
image  points  at  which  it  is  possible,  in  principle, 
to  unambiguously  determine  image  velocity.  Specif¬ 
ically,  corners  have  the  property  that  their  motion 
can  be  directly  computed  based  only  on  measurements 
made  at  the  corner  (in  practice,  of  course,  one 
must  examine  a  small  neighborhood  of  the  corner). 
Another  important  property  of  corners  is  that  they 
can  be  safely  regarded  as  projections  of  scene 
features  whose  general  appearance  is  invariant  to 
rigid  motion  -  e.g.,  an  image  corner  may  be  the 
projection  of  the  vertex  of  a  polyhedron,  or  of  a 
curvature  discontinuity  on  the  boundary  of  a  sur¬ 
face  marking.  The  second  step  involves  propagat¬ 
ing  these  velocity  estimates  to  a  larger  number  of 
picture  points.  This  is  based  on  the  assumption 
that  the  image  motion  is  locally  a  rig  id -two-dimen¬ 
sional  motion.  Given  this  assumption,  the  velocity 
at  one  point  on  a  contour  and  the  normal  component 
at  a  neighboring  point  can  be  combined  to  compute 
the  actual  velocity  at  the  neighboring  point. 

2 .  Estimating  motion  at  corners 

The  motion  of  a  corner  can  be  computed  in  a 
variety  of  ways.  Section  2,1  describes  an  approach 
based  on  temporal  intensity  changes  along  lines 
parallel  to  the  sides  of  the  corner.  Section  2.2 
discusses  a  second  technique  which  combines  normal 
vectors  in  a  small  neighborhood  of  the  corner  along 
the  contour.  It  is  similar  to  the  velocity  estima¬ 
tion  algorithm  in  Horn  and  Schunck  [4],  We  should 
also  point  out  that  Nagel  [5]  has  recently  proposed 
a  corner  velocity  estimation  1  ’orithm  based  on  a 
differential  approach  (l.e.,  a  Taylor  series  ex¬ 
pansion  of  the  image  function  in  the  neighborhood 
of  the  corner  truncated  after  the  second-order 
terms.) . 

2.1  A  structural  approach 

This  subsection  presents  a  structural  approach 
to  corner  motion  estimation.  We  first  describe 
velocity  computation  for  the  case  of  translation 
motion,  and  then  consider  translation  combined  with 
rotation. 
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From  Figure  3,  we  see  that 


Suppose  that  a  corner  simply  translates  from 
point  Cq  to  Cj  between  two  frames  tQ  and  ti  (see 
Figure  1).  Let  0*  be  a  point  on  the  bisector  of 
PCqR  and  let  O'A  and  O'B  be  lines  parallel  to  CqR, 
respectively  at  some  unit  distance  from  CqP  and 
CqR.  Suppose  that  j  0  *  A  j  =  1 0*  B |  *  1+m,  for  some 
constant  m.  Finally,  assume  that  the  intensity 
inside  the  corner  is  1  and  outside  the  corner  is  0. 

Now,  at  time  tQ,  the  average  intensity  along 
line  segments  0*A  and  O'B  is 


Ax'r  =*  Ax' r  -  Ax"r  (2.3) 

1+  Ay1 ^  6 

where  6  is  the  distance  between  the  parallel  line 
segments  O'A  and  CD.  Similarly 

Ay’r  =  Ay'r  -  Ay"r  (2.4) 

1+Ax't  6 


If  Ax'  and  Ay'  are  the  components  of  the 
translation  in  the  directions  of  the  lines  O'A  and 
O'B,  then 


Also 

Ax'r  -  Ax"r  =  (1+m)  [I^/t^  -2CD(tl)  1  (2>5) 
Ay'r  -  Ay"r  =  (1+m)  [I^gUj)  -1^(4)]  <2. 6) 


I0' A^£ 1^  *  <1+Ax')/(1+®) 

I0'B(tl)  *  U+Ay'J/U+m) 

assuming  that  m  Is  chosen  large  enough  so  that  max 
(Ax*, Ay')  <  m.  Finally,  Ax*  and  Ay*  can  be  com¬ 
puted  from 

I0'A = I0'A(tl)  ’  I0'A(tO)  =  Ax '/(1+m) 

I0'B  *  I0'B^tl^  ”  IO'B^t0^  =  y'  / (1+m) 

Once  Ax'  and  Ay'  are  computed,  the  components 
of  the  velocity  in  the  original  imcge  coordinate 
system  can  be  recovered  easily? 


The  practical  success  of  this  technique  de¬ 
pends  on  our  ability  to  compute  several  corner 
parameters  accurately.  These  parameters  are 


1.  corner  location  at  tQ, 

2.  corner  shape  (angles  a  and  d) ,  and 

3.  corner  contrast  (assumed  here  to  be  1) 

The  computation  of  these  parameters  is  dis¬ 
cussed  in  Section  4.1.  Next,  we  extend  the  previ¬ 
ous  simple  analysis  to  Include  rotation  as  well  as 
translation.  We  will  treat  this  case  as  a  trans¬ 
lation  from  Cq  to  followed  by  a  rotation  about 
through  a  clockwise  angle  y  (see  Figure  2) . 

Since  translation  and  rotation  are  specified  by  a 
total  of  three  parameters,  we  could  extend  the 
above  analysis  using  only  a  third  line  segment 
parallel  to  either  O'A  or  O'B.  Instead,  we  con¬ 
sider  two  pairs  of  parallel  line  segments,  and 
compute  the  displacements  in  the  directions  O'A 
and  O'B  rather  than  directly  computing  the  angle  y. 

Let  Ax’ ^  ,  Ayft  be  the  translational  components 
of  the  motion  In  the  O'A  and  O'B  directions,  and 
Ax *r  and  Ay'r  the  corresponding  rotational  compon¬ 
ents.  Then 

Ax *t  +  Ax'r  -  ( 1+m)  AI0,a  (2.1) 

Ay't  -  Ay'r  -  (1+m)  AI0,b  (2.2) 


Substituting  (2.7)  and  (2.8)  into  (2.9)  and 
(2.10),  we  can  also  compute  Ax'r  and  Ay'r,  which 
gives  us  a  complete  description  of  the  motion  of 
the  corner. 


2 . 2  A  least-squares  approac h 


In  this  section  we  show  how  simple  least- 
squares  algorithms  can  be  used  to  compute  corner 
motion.  We  can  rewrite  (1.1)  to  obtain 


Vn  ’  It/ 1 VII  (7-U) 

where  f V |  is  the  magnitude  of  the  spatial  intensity 
gradient  at  that  point  ( | VX |  -  A 2+J2  and  Vn  is 
the  projection  of  the  velocity'  x  y  vector  V  onto 
the  intensity  gradient  at  that  point.  We  will 
first  consider  the  case  when  the  velocity  is  only 
a  translation,  and  then  consider  translation  with 
rotation. 


If  we  assume  that  the  velocities  are  constant 
in  a  small  neighborhood  of  the  corner  along  the 
contour,  then  we  can  relate  the  problem  of  deter¬ 
mining  the  velocity  at  the  corner  to  that  of 
determining  a  V  to  minimize 
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E  "/  ^  cy-Sj  -  vnl>2}  (2.12) 

where  (n^.n^)  is  the  unit  normal  vec;  r  of  the 
1th  contour  ooint  and  Vni  is  the  projection  of  V 
onto  the  intensity  gradient  dii2ction  at  the  i.th  con¬ 
tour  point,  called  the  normal  projection  for  short. 

By  minimizing  the  error  e2,  we  obtain 


where 


au  +  cv  *  d 
cu  +  bv  »  e 


a  -  L  n 
1=1  11 

k  2 
b  -  £  n2 
i=l  * 
k 

C  ’  nil-ni2 


d  =  E  n  .  V  . 
i.l  11  ni 


e  =  E  n  ,.V 
i-1  12  nl 


U.13) 


From  equation  (2.12),  we  have  the  velocity 
estimation  of  a  turning  point 


bd-ce 

.  2 
ab-c 

ae-cd 

2 

ab-c“ 


(2.14) 


Notice  that  the  solutions  for  u  and  v  are  only 
meaningful  when  ab-c2(  which  is  related  to  the 
variance  of  normal  directions,  is  high.  For  a 
straight  line  segment,  e.g.,  there  is  no  solution 
because  the  denominators  of  eqs.  (2.13)  are  zero, 


ab-c  = 


1=1 


il 


k 

.  Z  r 

i=l 


12 


-( 


k 

Z  n 

i=l 


il,n12) 


By  rewriting  this  equation,  we  have 

V0x  -wdiy\  (2.16) 

V0y  +Wdix/ 

where  d^  .d^,  are  the  components  of  the  displace¬ 
ment  vector  d^  from  point  (xg.yg)  to  point  (x^.yj) 
on  the  contour.  The  normal  component  of  is 
related  to  d^,  w  and  V_q  by 


(V„  -wd  )n  +  (V„  -Kid.  )n 
Ox  iy  ix  Oy  ix  iy 


(2.17) 


By  considering  three  points  on  the  contour, 
and  Vgy  can  be  simply  obtained  by  solving  linear 
equations.  In  general,  more  than  three  image 
points  are  taken  and  Vgx.Voy  are  computed  by 
minimizing  the  following  square  error: 


E2  =  Z(v,,.n,  +V.  .n  +w(d,  n,  -d.  n.  )-V  ,)2 
Ox  ix  Oy  iy  ix  iy  iy  ix  ni 


We  obtain  the  following  least  squares  solution 


of  ,v.  : 

Ox  Oy 

C1 

S11 

al 

C2 

S02 

a2 

| 

V0x  = 

C3 

S01 

a3 
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S20 

ci 

al 
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V0y  * 

S10 

C3 

a3 

6 

where 


«  kn2.kn2  -  (taj  .n2)2 

=  0 

1.  .,  from  a  small  element  of  a  straight  line,  the 
only  information  that  one  can  obtain  is  the  motion 
component  normal  to  that  line,  and  motion  along 
this  line  element  cannot  be  detected.  Corners, 
however,  are  Just  those  points  where  the  variance 
of  normal  directions  is  locally  maximal. 

Now  consider  the  case  in  which  the  motion  of 
the  contour  can  be  decomposed  into  a  translation 
with  velocity  Vq  (Vgx.Vgy)  at  the  corner  (xg  ,y-) 
and  a  rotation  around  (xg^yg)  with  the  angular 
velocity  w,  as  shown  in  Figure  4.  Note  that  since 
we  are  only  interested  in  the  motion  of  the  corner, 
we  do  not  explicitly  solve  for  w,  although  it  would 
be  easy  to  do  so. 

W«  obtain  w  x  d,  =  V  -  V-  (2.15) 

-1  -i 


Sro.=Znr  nP  d* 

i  ix  iy  ix  iy 


i  =  En1 
rP  i  i 


nf 

iy 


ai  =  sino“s200i, 

a2  =  S021<fSU01, 
a3  =  S0110~S1001, 
C1  =  l  vxi,nix' 
c2  =  l  Vni‘niy' 
c3  =  ?  Vni  ' 
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3 .  Propagation  of  velocity  vectors  along  Image 
contours 

3.1  The  local  constraint  and  the  propagation 
formula 

Suppose  the  velocity  vectors  Vn*V^  at  the  ends 
of  a  contour  AoAfc  are  known  (see  Figure  A).  Con¬ 
sider  a  small  line  segment  dS  along  the  contour 
AqAj.  Assuming  that  the  motion  is  a  rigid  motion 
Vqs  of  Vq»  the  motion  of  Aq  parallel  to  AqAx  must 
equal  the  parallel  component  of  the  velocity 
Vi  at  Ax: 

V  =  V 

OS  IS  (3.1a) 

or 

VQ  .  dS  -  Vx  .  dS.  (3.1b) 

where  Vq  and  Vx  are  the  velocity  vectors  at  the 
two  ends  of  tTie  line  dS,  and  <3T>  is  the  unit  vector 
along  dS,  the  vector  joining  Ao  to  Ax*  Rewriting 
this  local  constraint  (eq.  3.1b)  into  component 
form,  we  obtain 

lo  •  «  ■  (V1„S  +  VltT)  •  « 

-  Vlrn  .  3s  +  VltI  .  dS  (3.2) 

where  V^n  and  V^t  are  the  normal  component  and  the 
tangential  component  of  the  velocity  vector  Vj 
respectively,  and  n  and  t  are  the  unit  vectoTS  In 
the  normal  and  tangent  directions  of  the  contour 
at  Af.  From  Figure  5,  we  see  that 

n.dS  ■  cosa  and  t.dS  *  cos(F/2-a)  «  sina 
so  that  (3.3) 

V0S  *  Vlt3lna  +  Vlncos“ 

Thus,  we  have  that  the  tangential  component  is 

Vu  -  (v^  -  .  cosa)/  sina  (3.4) 

where  a  is  the  angle  between  the  unit  vector  d? 
and  the  normal  vector  n  at  the  point  A^.  We  also 
have  y«8-a,  where  8  Is  the  angle  between  the  x-axls 
and  the  normal  vector  n,  and  y  is  the  angle  be¬ 
tween  the  x-axls  and  the  line  segment  dS. 

We  can  propagate  the  velocity  along  a  contour 
using  eq.  (3.4),  because  the  first  projection  VQS 
is  known  after  the  previous  propagation  and  the 
normal  component  V|n  can  be  computed  by,  e.g.,  the 
methods  discussed  in  (3J  or  [4J.  Once  Vln  is  com¬ 
puted,  can  be  obtained  because 

V1  "/A  n+Vlt  <3*5> 

8  -  8  -  arctan  Vlt/V^n 


3.2  Error  analysis  and  a  correction  technique 

From  eq.  (3.4)  the  new  estimate  of  the  tangent 
component  V^t  is  based  on  the  previous  projection 
VQs  and  on  the  normal  component  Vjn  at  the  current 
propagation  point.  Differentiating  this  equation 
we  obtain 


8Vlt 

dvos  4 

avlt 

dV. 

In 

*  3Vlt  da 

8Vln 

+  3a 

=  _1 _ 

dvos  - 

dV,  + 
In 

(Vln-V0SCO8a) 

da 

sina 

2 

sin  a 

(3.6) 

Note  that  the  error  in  Vi,  depends  on  the  error  in 
the  previous  projection  (dVQg) ,  the  error  in  the 
normal  component  Vjn  at  the  current  propagation 
point  (dV^n) ,  and  the  error  in  the  measurement  of 
the  angle  a  (da). 

The  result  of  these  various  errors  Is  that 
when  the  propagation  reaches  Aj,,  the  velocity  vec¬ 
tor  Vj  attributed  to  Aj,  by  the  propagation  proced¬ 
ure  wTll  differ  from  the  velocity  vector  originally 
computed  at  Aj,.  Therefore,  at  the  point  Ag  we 
compute  the  error  between  the  propagation  velocity 
estimate  V£  and  the  original  velocity  vector  Vg 
and  compute  the  error 


If  this  error  is  less  than  some  tolerance,  then 
this  propagation  procedure  is  stopped  at  point  A^; 
otherwise  a  correction  procedure  is  applied.  If 
we  consider  the  error  AVk  as  having  been  accumulated 
in  the  previous  n  steps,  then  the  average  velocity 
error  in  one  step  is 

V  =  AV./k 
e  _ k 

so  we  have  m.Ve  as  the  velocity  error  at  the  mth 
step  and  we  pTSpagate  this  velocity  error  step  by 
step  backward  to  correct  the  estimated  velocity 
vector  at  each  point  along  the  same  contour. 

4 .  Experimental  results 

We  applied  the  corner  motion  estimation  and 
velocity  propagation  algorlttins  to  two  sets  of 
motion  pictures.  Section  4.1  describes  the  corner 
motion  estimation  results,  and  Section  4.2  describes 
the  propagation  results. 

4,1  Corner  motion  estimation 

The  three  corner  motion  models  described  in 
Section  2  were  applied  to  two  image  sequences  con¬ 
taining  two  frames  each  (Figures  6-7). 

We  first  describe  the  application  of 
the  structural  model  presented  in  Section  2.1. 
Corners  are  "provisionally"  detected  using  the 
corner  detection  algorithm  described  in  Kitchen 
and  Rosenfeld  [6],  Next,  a  small  window  around 
each  corner  is  analyzed  to  obtain  a  more  accurate 
description  of  the  corner.  Based  on  the 
assumption  that  the  corner  locally  contrasts 
with  Its  surround,  a  local  thresholding 


127 


procedure  (Milgram  [7])  is  used  to  segment  the 
window.  The  corner  is  then  relocated  to  a  maximum 
curvature  boundary  point  in  the  thresholded  window. 
The  slopes  of  the  line  segments  meeting  at  the  cor¬ 
ner  are  computed  using  a  one-d imensional  (slope) 
Hough  transform  procedure  (only  slope  need  be  com¬ 
puted  since  the  lines  are  constrained  to  pass 
through  the  corner  point.)  The  corners  detected 
by  this  procedure  are  marked  with  dark  crosses 
in  Figures  6a  and  7a. 

To  overcome  the  effects  of  various  sources  of 
error  on  the  motion  estimation,  several  quadruples 
of  line  segments  are  used  to  compute  estimates  of 
Ax't,  Ay’t,  Ax*r,  and  Ay*r>  with  the  final  motion 
estimate  taken  as  the  average. 

The  results  for  the  airplane  in  Figure  6  are 
displayed  in  Table  1.  The  estimated  motion  vec¬ 
tors  were  obtained  by  the  authors*  examination  of 
digital  enlargements  of  the  images.  No  useful 
results  were  obtained  for  the  moving  car  in 
Figure  7.  There  are  several  reasons  for  this: 

1.  The  grey  level  corners  in  the  car  are 
much  more  rounded  than  the  airplane’s, 
and  the  motion  estimates  are  sensitive 
to  the  corner  location;  and 

2.  The  spatial  resolution  of  Figure  7 
is  not  high  enough  to  allow  us  to 
place  a  sufficiently  large  window 
around  a  corner  for  segmentation 
which  does  not  contain  some  other 
image  feature. 


The  least-squares  corner  motion  estimates 
presented  in  Section  2.2  require  that  we  first 
compute  the  normal  component  of  motion  along  the 
contour  in  the  neighborhood  of  the  corner.  The 
magnitude  of  the  normal  component  and  the  compon¬ 
ents  of  the  unit  normal  vector  on  the  x  and  y  axes 
are 
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The  unit  of  length  is  the  grid  spacing  interval  in 
each  image  frame  and  the  unit  of  time  is  the  image 
frame  sampling  period. 


Tables  2  and  3  contain  the  results  of  apply¬ 
ing  both  lea st -squares  corner  estimators  to  the 


images  in  Figures  6  and  7,  respectively. 

4  .2  Velocity  propagation 

We  applied  the  propagation  technique  to  the 
two  image  sequences  displayed  in  Figures  6  and  7. 

The  propagation  technique  was  implemented  as 
follows: 

1)  Velocity  vectors  are  first  determined  at  a 
set  of  "corner"  points  in  the  first  frame 
by  the  least-square  corner  motion  estimator 
which  assumes  that  the  corner  motion  is  a 
2-D  translation. 

2)  The  velocity  vector  at  the  corner  is 
propagated  along  the  contours  that  meet  at 
the  corner  until  a  second  corner  point  is 
encountered.  The  contours  are  followed  by 
a  very  simple  maximum  gradient  technique. 

A  velocity  vector  is  not  computed  at  every 
pixel  on  the  contour,  but  only  at  every 
kfch  pixel,  to  reduce  the  error  in  a. 

3)  When  the  cerminating  corner  point  is 
reached,  the  propagation  is  stopped  and 
the  error  velocity  vector  is  computed.  If 
this  error  is  greater  than  a  preset  toler¬ 
ance,  then  the  error  velocity  vector  is 
back-propagated  along  the  same  contour. 

The  results  of  the  propagation  are  displayed  in 
Figures  8-9. 

5.  Summary 

We  have  presented  a  contour-based  approach  to 
motion  estimation  based  on  first  estimating  motion 
at  image  corners  and  then  propagating  these  motion 
estimates  along  image  contours.  One  potential 
advantage  of  such  an  approach  over  others  such  as 
[3-4]  is  that  motion  information  is  not  integrated 
across  the  boundaries  of  moving  objects,  but  only 
along  such  boundaries.  Since  very  often  the  only 
reliable  source  of  motion  information  is  at  object 
boundaries  (when,  for  example,  object  interiors 
are  homogeneous)  it  is  important  that  motion  esti¬ 
mation  techniques  yield  accurate  motion  estimates 
at  boundaries. 

The  examples  presented  in  Section  4  both 
consisted  of  a  single  object  moving  across  a  homo¬ 
geneous  background.  The  propagation  technique 
presented  in  Section  3  would  need  to  be  modified 
to  be  applicable  to  more  complex  image  sequences 
containing  multiple  moving  objects  and  occlusions 
so  that  motion  information  is  not  propagated  from 
one  object  to  another. 
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Fig.  4.  Two-dimensional  rigid  motion. 


Figure  8.  Velocity  field  using  the  propagation 
technique  along  the  contours  of  the 
moving  airplane  shown  in  Figure  6. 


Figure  9.  Velocity  field  using  the  propagation 
technique  along  the  contours  of  the 
moving  car  shown  in  Figure  7. 
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Table  1.  Motion  vectors  f<Jr  corners  in  Figure  6. 
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Table  3.  Motion  ectora  for  turning  points  of  Figure  7  (traffic). 
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The  image  irradiance  equation  constrains  the  relationship 
between  surface  orientation  in  a  scene  and  the  irradiance  of 
its  image.  Additional  constraints,  needed  to  recover  surface 
orientation  from  image  irradiance,  usually  require  the  recovered 
surface  to  be  smooth.  We  demonstrate  that  smoothness  is  not 
sufficient  for  this  task. 

The  ‘beliefs'  a  visual  system  has  about  the  laws  of  surface 
radiance  provide  sufficient  constraints  to  relate  the  irradiance  of 
an  image  to  the  corresponding  scene's  surface  orientations.  We 
propose  a  set  of  beliefs  and  derive  surface  orientation  -  image 
irradiance  equations  for  a  visual  system  with  these  beliefs.  The 
surface  orientations,  derived  from  the  image  by  a  visual  system 
with  these  beliefs,  may  not  be  those  present  in  the  scene,  rather 
they  are  those  believed  to  be  present. 

1  INTRODUCTION 

Most  previous  work  that  has  addressed  the  problem  of 
the  recovery  of  surface  shape  from  shading  (image  irradiance) 
has  been  based  on  solving  the  image  irradiance  equation  which 
relates  the  radiance  of  a  scene  to  the  irradiance  of  its  image  [1-8). 
This  formulation  of  the  relationship  between  scene  radiance  and 
image  irradiance  is  embodied  in  a  first-order  partial  differential 
equation.  The  approaches  to  solving  this  differential  equation 
have  generally  been  either  direct  integration  along  characteristic 
curves  [l],  or  an  iterative  algorithm  that  attempts  to  reduce 
the  difference  between  the  predicted  image  irradiance  and  the 
recorded  value  (5-7). 

As  the  image  irradiance  equation  is  a  single  equation 
relating  the  image  irradiance  and  two  independent  variables 
(specifying  surface  orientation),  it  does  not  uniquely  determine 
the  two  independent  variables  for  a  given  value  of  image  ir¬ 
radiance.  Consequently  when  this  equation  is  used  to  recover 
surface  shape  additional  constraints  are  necessary.  These  may 
be  imposed  by  boundary  conditions,  by  restrictions  on  the  type 
of  surface  to  be  recovered,  or  by  a  combination  of  the  two.  For 
some  images,  when  we  can  determine  important  features  (such 
as  the  fact  that  an  edge  is  an  occlusion  boundary  caused  by  a 
surface  turning  smoothly  sway  from  the  viewing  direction),  we 
can  use  boundary  conditions  to  constrain  the  solution;  in  large 
portions  of  the  image,  however  we  can  say  something  only  about 
the  type  of  surface  we  would  like  to  recover.  Surface  smooth¬ 
ness  is  the  weakest  assumption  to  date  that  still  allows  surface 


shape  to  be  recovered.  Smoothness  normally  signifies  that  the 
surface  is  continuous  and  that  it  is  once  or  twice  differentiable. 
Smoothness  has  been  required  to  play  the  role  of  propagator  of 
boundary  conditions  and  selector  of  the  surface  to  be  recovered. 
Is  smoothness  capable  of  these  tasks? 

Not  all  authors  have  used  smoothness  as  their  additional 
constraint.  Pentland’s  additional  constraints  are  concerned 
with  surface  shape  [8].  He  assumes  that  locally  the  surface  has 
equal  curvature  along  orthogonal  directions.  This  is  an  assump¬ 
tion  that  is  strong  enough  to  allow  but  a  single  interpretation 
for  the  surface  orientation  and,  at  the  same  time,  is  one  that 
enables  recovery  of  the  surface  orientation  by  purely  local  com¬ 
putation. 

In  the  new  formulation  presented  here,  we  attempt  to  avoid 
assumptions  about  the  particular  form  of  R(p,q )  or  of  the  sur¬ 
face  shape.  Rather  we  define  a  class  of  surface  radiance  func¬ 
tions  that  includes  those  functional  forms  generally  used  for 
surface  radiance,  and  for  this  class  we  derive  equations  relating 
surface  orientation  to  image  irradiance. 

The  class  of  surface  radiance  functions  is  defined  by  two 
properties  of  the  class.  One  may  \  iew  these  properties  as  axioms 
that  a  visual  system  uses  to  define  its  beliefs  about  the  physical 
nature  of  the  world.  When  our  formulation  is  viewed  in  this 
manner,  our  assumptions  are  about  the  properties  of  reflection 
in  the  world  rather  than  about  specific  reflection  functions  or 
specific  surface  shapes. 

The  presentation  we  adopt  in  this  paper  is  to  describe  the 
various  formulations  that  have  been  used  to  employ  smooth¬ 
ness,  including  a  relaxation  procedure  of  our  own  that  resembles 
its  counterpart  in  engineering,  and  then  to  derive  general  sur¬ 
face  orientation  image  irradiance  equations.  We  look  at  some 
restricted  situations  that  may  prove  useful  in  using  this  for¬ 
mulation  to  recover  surface  shape. 


2  ITERATIVE  FORMULATIONS  FOR 
SURFACE  RECOVERY 

The  image  irradiance  equation  as  presented  by  Horn  [2|  is 

—  R(p,q)  , 

where  /(x,y)  is  the  image  irradiance  as  a  function  of  the  image 
coordinates  x  and  y,  and  R(p,q)  is  the  surface  radiance  as  a 
function  of  p  and  q,  the  derivatives  of  depth  with  respect  to  the 
image  coordinates.  In  the  'shape  from  shading'  approach  it  is 
generally  assumed  that  R(p,q)  is  known  for  all  p  and  q  (that  is, 
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the  reflectance  map  is  specified).  The  iterative  approach  applies 
this  equation  on  a  pixel-by-pixel  basis,  that  is  for  pixel  (i, ;) 

Ii.i  =  . 

•  here  lt  J  is  the  image  irradiance  for  (i,y)th  pixel,  and  ptJ,  qt j 
is  the  surface  orientation  of  the  surface  patch  that  is  imaged  at 
pixel  (t.  j).  For  convenience  we  use  the  notation 

Ri.i  = 

If,  at  some  stage  of  the  iterative  procedure,  we  have  as¬ 
signed  particular  Pij.qi.j  as  the  surface  orientation  of  the 
(i,j')th  pixel  then  the  residual  expression 

«.,/  -(A.,-*..,)* 

specifies  the  error  caused  by  our  assignment  of  surface  orienta¬ 
tion.  If  this  were  our  only  constraint  we  could  select  p, 
so  that  ( i,jK  =  0.  This  would  guarantee  that  the  image  ir¬ 
radiance  equation  is  satified  pixel  by  pixel,  but  because  there 
are  infinitely  many  solutions,  we  need  further  constraints  to 
reduce  the  number  of  possible  solutions. 

Smoothness  is  usually  introduced  by  specifying  a  relation¬ 
ship  that  we  would  like  to  have  hold  between  the  surface  orienta¬ 
tion  of  the  (f.y )th  pixel  and  its  neighbors.  The  various  iterative 
approaches  [5-7)  differ  in  the  way  this  relationship  is  specified. 
Of  course,  at  a  particular  stage  of  the  iterative  process  this 
relationship  between  neighboring  pixels  will  not  be  exact.  Once 
again  we  can  specify  a  residual  equation  for  the  error  in  the 
smoothness  relation. 

(i.jS  =  •  ?i.j.Pi- l.j  >  Qi+ 1 ,  j » 

Pi,j- 1 .  1  •  Pi,j+ 1 .  1 .  —  )|  i 

where  /  is  the  relationship  between  the  surface  orientation  at 
(i,»  and  its  neighbors.  An  example  of  the  type  of  relationship  is 
the  difference  between  the  surface  orientation  of  pixel  (■',/)  and 
the  mean  value  of  the  surface  orientations  of  its  4-neighbors. 

We  have  two  constraints  that  need  to  be  satisfied  simul¬ 
taneously,  one  from  image  irradiance  and  one  from  surface 
smoothness.  At  each  stage  of  the  iterative  process,  the  total 
residual  error  for  pixel  (s',  j)  can  be  described  by 

fi.J  *  +  ti,jS  > 

where  X  is  a  weighting  factor  that  can  adjust1  the  influence  of 
the  error  in  image  irradiance  to  the  error  in  smoothness.  For 
the  image,  the  total  residual  error  is 

( “  cC 

i.J 


The  allocation  of  surface  orientations  to  all  pixels  should  mini¬ 
mise  this  total  error,  that  is, 


'Since  the  error  in  image  irradiance  it  not  neecesaril?  commensurate  with 
that  in  surface  smoothness  some  form  of  normalisation  is  required. 


Differentiating  (  with  respect  to  pij  and  also  with  respect  to 
qi  gives  two  equations  for  each  pixel  in  the  image.  While  com¬ 
plicated  forms  of  the  relationship  between  Pi and  their 
neighboring  pixels  will  generally  occur  we  choose  our  smooth¬ 
ness  relation  so  that  we  can  arrange  the  equations  in  the  form 

Pi  j  =  F\  (Pi,j,qij,  and  p's  and  q't  of  neighboring  pixels )  , 

q,j  =  F2(p,j,  qit j,  and  p  «  and  q'e  of  neighbo-ing  pixels)  , 

where  F, ,  and  F2  are  functions. 

We  therefore  have  an  iterative  scheme  that,  given  some 
initial  solution,  we  improve  by  reducing  the  residual  error  in 
image  irradiance  and  surface  smoothness.  We  need  to  ask  the 
following  questions  of  such  a  scheme:  Under  what  conditions 
will  it  converge  to  a  solution?  Is  that  solution  unique?  Can 
boundary  conditions  be  used  as  well?  Does  smoothness,  as 
defined  by  our  relation,  give  us  the  type  of  surface  we  want? 

3  SURFACE  ORIENTATION 

There  are  many  equivalent  parameteri rations  of  surface 
orientation.  Mentioned  previously  were  the  parameters  p  and 
q,  the  derivatives  of  depth  with  respect  to  image  coordinates. 
Some  authors  prefer  to  specify  surface  orientation,  using  slant 
and  lilt.  Slant  is  the  angle  between  the  surface  normal  and  the 
viewing  direction,  while  tilt  is  the  angle  between  the  image  z 
axis  and  the  projection  of  the  surface  normal  onto  the  image 
plane.  Other  parameteri rations  [7]  have  been  used  when  par¬ 
ticular  properties  of  the  parameteriration  are  to  be  exploited. 
The  parameters  we  use  are  1  and  m: 

I  =  sin  o  cosr  , 
m  =  sinosinr  , 

where  o  is  the  surface  slant  and  r  its  tilt,  1  is  the  component 
of  the  surface  normal  in  the  direction  of  the  z  axis  and  m  is 
the  component  in  the  y  direction.  We  select  this  particular 
parameteriration,  as  1  and  m  are  bounded: 

0  <  I2  +  m2  <  1  . 

For  surfaces  that  we  can  sec, 

°<'<  \  • 

0  <  r  <  2*  . 

Consequently  1  and  m  specify  the  surface  normal  of  an  imaged 
surface  without  ambiguity. 

Sometimes  it  is  useful  to  thiok  of  the  orientation 
parameteriration  as  being  obtained  by  mapping  points  on  the 
Gaussian  sphere  onto  a  transformation  plane.  I  and  m  space 
is  the  disc  obtained  by  orthographically  projecting  onto  an 


equatorial  plane  the  points  on  the  hemisphere  of  the  Gaussian 
sphere  representing  those  surfaces  that  can  be  seen . 


4  FORMULATIONS  USED  FOR 
SURFACE  RECOVERY 


To  explore  the  issues  of  convergence,  propagation  of  bound¬ 
ary  conditions,  and  the  type  of  surface  smoothness  promotes,  we 
formulate  the  problem  in  two  ways:  parallelling  the  technique 
previously  described  and,  alternatively,  resembling  the  use  of 
the  relaxation  method  to  solve  structural  engineering  problems. 

The  function  for  scene  radiance,  used  to  create  synthetic 
images  for  the  experiment  and  used  by  the  shape  recovery  al¬ 
gorithms,  is 


R(/,m)  =0.1509( 


I  -t-yAl-/4-! 


) 


+  Af  a*[0.4437  t/T—  P  —  m2  +  0.31371  +  0.3137m, 0). 


This  function  is  appropriate  for  a  scene  exhibiting  Lambertian 
reflectance  and  which  is  illuminated  by  both  a  collimated  source 
and  a  uniform  hemispherical  source  (atmosphere).  The  par¬ 
ticular  numerical  constants  specify  the  light  direction  and  in¬ 
tensity,  and  the  surface  albedo. 

The  first  formulation  is  similar  to  that  described  pre¬ 
viously;  we  shall  call  this  the  ‘usual'  formulation.  From  the 
image  irradiance  equation  we  have  the  error  term 


■I/ti  -«••>) 


The  smoothness  constraint  is  the  requirement  that  1,-y  be 
the  average  of  its  4-neighbors,  and  rri.j  be  the  average  of  its 
4-neighbors.  The  error  term  for  smoothness  is 

(./  =  (U,i  -  '-‘ri 

.  ,  mi-  l.j  +  m<>  M  +  mi,i-  I  +  mi.j* 1  ,2 

+(*"•,; - J -  -  ) 

Note  that  this  constraint  is  exact  for  a  surface  that  is  locally 
spherical,  i.e.,  has  equal  curvature  along  orthogonal  directions. 

Minimising  (  —  Et.j  +  (i.jS  by  differentiating  with 
respect  to  /j  j ,  and  with  respect  to  m, and  then  setting  each 
result  equal  to  tero,  we  obtain  the  expressions 

Kj  =*  0.4(f,_i,y  +  li+tj  +  lij- 1  +  fi.j+l  )  " 

0.1(/,_t.j_i  +  li+i,i+i  +  li-i.j+i  +  li+i.y-i)- 

0  05(1,-2.^  +  f,+2j  +  lj.,-2  +  f,',y+2 )  + 

0.8X(/,,j  —  ] |  i.j  , 

mij  “  +  mqi.j  +  mj,y_i  +  - 

0.1(mj_|  +  n»i-i,y+i  +  mt+i,y_i)  — 

0.0S(m,_2,j  +  ml+2,y  +  miJ_2  +  m.j+s)  + 

0.8X(/(,y  -  | i,j 


The  other  formulation  we  use,  the  'engineering'  formula¬ 
tion,  creates  error  terms  from  the  image  irradiance  equation 
and  the  smoothness  constraints,  but  does  not  combine  these 
into  one  term. 

ii,j  =  ( U,j  R%,j )  . 

(.,/■  =  (kj - )  , 

,  s,  ,  rni-i  j  +  m.+u  +  m.j-i  +  m,.J+|  , 

(>.J  —  ("‘i.j  4  )  • 

We  view  the  ('s  as  residuals  and  apply  the  relaxation  approach 
of  reducing  the  largest  residuals.  If  or  is  selected 

for  reduction  we  choose  to  reduce  both  as  each  is  independent 
of  the  other.  When  (i.j^  is  chosen  for  reduction,  we  do  the 
reduction  in  two  stages,  -  one  stage  altering  /,j  and  the  other 
mi  j.  Of  course,  we  can  scale  the  residuals,  reduce  them  from, 
say,  the  image  irradiance  equation  to  a  certain  level  before  in¬ 
troducing  smoothness,  vary  the  amount  of  correction  we  apply, 
(e  g.  we  can  ovenrelax)  and  the  like.  In  fact  we  can  experiment 
with  various  relaxation  approaches.  In  this  formulation  major 
changes  in  the  relaxation  scheme  generally  require  minor  pro¬ 
gramming  changes. 


5  EXPERIMENTAL  RESULTS 

The  test  image,  shown  in  Figure  1,  was  that  of  a  hemi¬ 
sphere  placed  on  a  plane,  i.e.,  a  synthetic  image  generated  by 
the  reflectance  function  previously  described.  The  collimated 
light  source  is  at  slant  J  and  tilt  J,  that  is,  the  light  source  is 
at  the  upper  right  as  we  view  the  image.  We  purposely  avoided 
the  rase  in  which  the  collimated  source  is  at  the  same  position 
as  the  viewer,  since  the  resulting  symmetric  reflectance  map 
might  bias  the  algorithm  to  return  a  symmetric  surface.  A 
synthetic  image  of  a  sphere  was  selected  as  the  test  image  be¬ 
cause  both  the  image  irradiance  equation  and  the  smoothness 
relationship  we  use  hold  exactly.2  The  performace  of  the  algo¬ 
rithm  to  recover  the  surface  shape  could  be  assessed  without  the 
complications  involved  in  using  inexact  models  for  reflectance 
and  smoothness. 

We  need  initia’  solutions  to  start  our  iterative/relaxation 
procedures.  We  used  four  sets  of  initial  conditions:  (1)  a  plane 
perpendicular  to  the  viewing  direction;  (2)  a  plane  slanted  J  to 
the  viewing  direction;  (3)  a  cone  with  its  axis  in  the  viewing 
direction;  (4)  the  correct  solution  perturbed  by  small  random 
errors. 

Previous  work  has  used  boundary  conditions  to  constrain 
the  recovered  surface  Investigating  this  approach  we  con¬ 
strained  the  surface  in  various  ways:  at  the  edge  of  the  hemi¬ 
sphere,  at  a  closed  curve  -lying  on  the  sphere's  surface,  or  at 
individual  points  on  the  sphere's  surface.  We  also  used  the  al¬ 
gorithms  without  any  boundary  conditions  whatsoever. 

Since  we  wished  to  investigate  the  ability  of  smoothness  to 
propagate  boundary  conditions,  we  used  various  image  quan- 


We  use  these  ae  our  iterative  scheme  to  improve  on  an  intial 
solution. 


*The  smoothness  relationship  does  not  hold  at  the  edae  of  the  hemisphere 
where  it  joins  the  plane. 
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fixations,  namely  IA  X  1A,  3?  X  32,  and  84  X  64. 

The  findings  can  be  characterized  as  follows: 

•  Both  techniques  -  the  engineering  and  the  usual  method  - 
gave  essentially  the  same  results. 

•  The  engineering  technique  converged  much  faster  than  the 
usual  technique. 

•  Smoothness  propagates  boundary  conditions  by  no  more 
than  a  few  pixels 

•  The  initial  solution  largely  predetermines  the  final  solution. 
Figures  2-11  display  examples  of  the  results  we  achieved 

by  means  of  the  ‘usual’  iterative  scheme.  In  each  of  these 
figures  the  top  left  picture  shows  the  profile  of  the  recovered 
surface  (viewed  from  the  bottom  left  corner)  .  while  at  the  top 
right  we  find  an  image  that  is  the  sine  of  I  he  surface  slant, 
with  black  representing  0  and  white  I.  The  bottom  left  is 
the  cosine  and  the  bottom  right  the  sine  of  the  surface  tilt, 
with  black  representing  -1,  gray  0,  aa  «hite  4-1.  The  results 
are  presented  in  this  manner  so  that  the  performance  of  the 
algorithms  can  be  assessed.  The  profile  can  on  occasion  appear 
more  accurate  than  the  individual  surface  orientations  (as  might 
be  expected  of  an  integration  procedure);  on  other  occasions, 
however,  errors  in  the  surface  orientation  (sometimes  just  from 
the  image  quantization)  of  highiy  slanted  surfaces  cause  the 
integration  routine  that  produces  the  surface  shape  for  profiling 
to  overstate  the  error.  Figure  2  shows  the  results  that  should  be 
obtained  if  the  shape  recovery  algorithms  recovered  the  surface 
exactly. 

Figures  3-6  illustrate  the  effects  of  various  boundary  con¬ 
ditions.  The  errors  at  the  edge  of  the  sphere,  where  it  joins  the 
plane  are  expected,  as  smoothness  does  not  hold  there.  Each 
figure  is  the  result  of  320  iterations,  this  being  five  times  the 
linear  dimension  of  the  picture  used.  The  boundary  condition 
at  a  point  affects  an  area  of  approximately  10  pixels  in  radius. 
Only  for  Figure  6,  where  a  random  five  percent  of  pixels  were  set 
to  their  correct  values,  is  the  surface  shape  recovered  correctly. 
Smoothness  as  a  propagator  affects  but  a  small  area.  Figures 
4.7,  and  8  further  illustrate  this  point.  Here  various  image  sizes 
are  used.  Observe  that,  as  the  image  size  increases,  the  bound¬ 
ary  conditions  have  less  effect  and  the  solution  becomes  progres¬ 
sively  worse.  Figures  4,0,  and  10  show  the  dominant  influence 
of  the  initial  solution.  Figure  1 1  is  included  to  show  the  effect 
of  smoothness  when  X  =  0  -  namely,  when  image  irradiance 
does  not  affect  the  solution  at  all.  This  figure,  obtained  after 
3°0  iterations  demonstrates  what  smoothness  alone  can  achieve, 
even  when  the  definition  of  smoothness  is  exact  for  the  viewed 
scene  (a  sphere). 

Smoothness  is  a  poor  selector  of  surface  shape  and  a 
poor  propagator  of  boundary  information  when  it  is  used  to 
tie  the  surface  orientation  of  a  particular  surface  point  to 
those  of  its  neighbors.  Generally,  in  engineering,  (  roblems 
solved  with  relaxation  techniques  are  formulations  that  relate  n 
given  property  at  one  point  to  that  of  its  neighbors  by  means 
of  differential  relations.  It  is  the  derivative  that  propagates 
boundary  information  and  selects  a  particular  solution  to  be 
recovered.  Following,  we  present  such  a  formulation  in  an  at¬ 


tempt  to  relieve  smoothness  of  its  role  as  propagator  and  selec¬ 
tor. 


6  PROPERTIES  OF  SURFACE  RADIANCE 

Our  formulation  of  the  relationship  between  image  ir¬ 
radiance  and  scene  radiance  is 

Hz,y)  =  R(t,m)  , 

where  /(*,  p)  is  the  image  irradiance  at  image  point  x,y  and 
R(l,m)  is  the  scene  radiance  for  a  surface  normal  we  represent 
by  /.m.  Note  that  we  have  not  selected  a  particular  image 
projection  in  formulating  this  equation.  I  and  m  are  the  sur¬ 
face  normals  relative  to  the  viewing  direction  for  that  surface 
patch.  If  we  are  using  a  projective  transformation,  /  and  m  cal¬ 
culated  for  image  point  x,y  will  then  have  to  be  adjusted  for  the 
projective  distortion;  if  we  are  using  orthographic  projection, 
zero  adjustment  will  be  required. 

A  is  a  function  of  the  components  of  the  surface  normal 
and  they  in  turn  are  functions  of  image  coordinates.  R(l,m) 
specifies  the  relationship  between  surface  radiance  and  surface 
orientation,  while  t(z,y)  and  m(x,  y)  specify  the  relationship 
between  surface  orientation  and  image  coordinates.  R(t,m ) 
embodies  knowledge  of  the  nature  of  surface  reflection,  while 
l(x,y)  and  m(z,y)  embody  the  surface  shape. 

To  provide  the  additional  constraints  we  need  for  relating 
surface  orientation  to  image  irradiance,  we  introduce  axioms 
that  relate  properties  of  R{l,  m),  -  that  is,  the  constraints  specify 
the  relationship  between  surface  radiance  and  surface  orienta¬ 
tion.  Our  axioms  are 

(1  —  P)Ru  =  (1  -  m2)Rmm  , 

(Rll  ~  Rmm)l">  =  (f®  -  mS)Rlm  , 

where  Ru  is  the  second  partial  derivative  of  R  with  respect  to 
I,  Rmm  is  the  second  partial  derivative  of  R  with  respect  to  m, 
and  R/m  is  the  second  partial  cross  derivative  of  R  with  respect 
to  I  and  m.  Pentland  has  pointed  out3  that  if  one  interprets  the 
image  of  a  unit  sphere  as  the  reflectance  map  for  a  Lambertian 
surface  illuminated  in  the  same  manner  as  the  sphere,  then 
(i)  bis  result  [8]  is  the  second  axiom,  and  (ii)  the  first  axiom 
may  be  derived  from  his  equations.  We  choose  to  take  these 
relationships  as  axioms  rather  than  use  other  assumptions. 

These  axioms  do  not  have  to  be  true  embodiments  of  the 
pLysical  laws  of  nature;  rather,  they  represent  the  beliefs  a 
visual  system  has  regarding  the  physical  laws  of  nature.  In 
circumstances  in  which  such  beliefs  do  not  hold  the  visual  sys¬ 
tem  will  make  errors  in  predicting  the  true  nature  of  the  world. 
Of  course,  if  these  axioms  are  not  good  approximations  for  the 
physical  laws  of  nature,  the  visual  system  embodying  them  is 
useless. 

Both  axioms  hold  for  the  following  particular  forms  for 
scene  radiance.4 


•Personal  cor  munication. 

•Except  at  a  self-shadow  edge,  where  f?(l,  m)  is  not  differentiable. 
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(i)  Lambertian  reflectance:  illumination  by  a  uniform 
hemispherical  source  (for  example  ‘sky’  illumination) 


R(l,m)  =  o( 


1  +  v/l  —  I2  —  ms  . 


where  a  is  a  constant,  a  function  of  the  illumination 
strength  and  surface  albedo. 

(ii)  Lambertian  reflectance:  illumination  by  a  single 
collimated  source  (for  example  infinite  point  source) 

R(l,  m)  =  Max[0,  a(t\/ 1  —  P  —  m*  +  cl  +  dm)]  , 

where  a  is  a  constant,  a  function  of  the  illumination 
strength  and  surface  albedo,  and  4,c,  and  d  are 
constants,  functions  of  the  position  of  the  illumination 
source. 

(iii)  Lambertian  reflectance;  illumination  by  multiple 
collimated  sources  plus  a  uniform  hemispherical  source 

R)l,  m )  =n( - — - ) 

+  53  Maz]0,  °t(*iv/l  —  P  —  m2  +  c,/  +  dim)]  , 
a 

where  i  denotes  the  source. 

If  an  extended  light  source  is  considered  as  comprising 
a  large  set  of  collimated  sources  at  different  positions 
with  different  directions  and  illumination  strengths, 
then  extended  sources  are  included  in  this  case. 

(iv)  Scene  radiance  functions  that  are  linear  functions  of 
the  components  of  the  unit  surface  normal 

R(l,m)  •=■  t\fl  —  P  —  m2  +  //  +  f m  , 

where  c,/,  and  g  are  constants.  Note  that  \/l  —  P  —  m2  is 
the  component  of  the  unit  surface  normal  in  the  direction 
perpendicular  to  the  image  plane. 

For  some  forms  of  the  scene  radiance  expression,  only  one 
axiom  holds.  This  is  the  situation  for  images  produced  by  a 
scanning  electron  microscope.  The  expression  for  scene  radiance 

PI  i» 


where  n  is  a  constant,  usually  having  a  value  between  I  and  10 
that  determines  the  'sharpness'  of  the  specular  peak. 

For  the  maria  of  the  moon  the  form  of  scene  radiance 
usually  used  (2)  is 

y/\  —  P  —  mi 

a,b,c,  and  d  in  the  above  expressions  are  constants  associated 
with  the  strength  and  position  of  the  light  source,  and  with  the 
surface  albedo. 

The  axioms  do  not  hold  in  either  of  the  preceding  cases.  We 
would  expect  a  visual  system  embodying  them  to  make  errors 
under  these  circumstances.  Nevertheless  this  should  not  induce 
us  to  immediately  begin  searching  for  new  axioms.  After  all  the 
human  visual  system  is  not  perfect  under  conditions  of  specular 
reflection;  moreover,  people  observed  the  moon  throughout  hiv 
tory  without  concluding  that  it  was  spherical. 

If  these  axioms  are  embodied  in  the  human  visual  system, 
the  predictions  based  on  them  -  i.e.  when  the  visual  system  will 
return  'correct'  and  'incorrect'  information  -  could  be  tested  by 
psychological  experiments. 

We  investigate  the  relationship  between  image  irradiance 
and  surface  shape  when  both  these  axioms  bold. 


1  SURFACE  ORIENTATION  - 

IMAGE  IRRADIANCE  EQUATIONS 


Differentiating 


/(x,r)~«(l,m) 


with  respect  to  x  and  y,  we  obtain 


If  ==!  Rllm  t  RnmM  , 


If  “  Hi  If  +  Rmmp  i 

I#»  **  Hiilf^  +  Rmmmg^  +  ffijmlffSf  +  R'Jmm  +  Rmmap  , 


The  first  axiom  does  not  bold  but  the  second  (Ru  -  Rmm)tm  — 
(Is  —  m2)Rtm  does.  Note  that  ij—^p  ■»  so  that  the  second 
axiom  is  about  surface  tilt.  The  fint  axiom  introduces  slant.  In 
using  the  equations  (that  we  are  yet  to  derive)  to  recover  surface 
orientation  one  might  anticipate  that  they  would  predict  tilt 
correctly  for  the  surfaces  in  electron  microscope  images,  but 
err  in  predicting  slant. 

For  othei  forms  of  the  scene  radiance  expressions  neither 
axiom  holds.  Specular  reflectance  has  been  approximated  [2]  by 

R(l,m)-  s(Ml-I*-m*) 

+  el\/l  -  P  -  m* 

+  dm\J I  —  P  -  m*|*  , 


l„  —  Riit,2  +  Rmmm,2  +  2Rimt,m,  +  Rtl„  +  Rmm„  , 

lip  “  +  Rmmm,mt  +  Rim(l,mp  +  l,m,) 

+  Rtlgp  +  Hmmfj  , 

where  subscripted  variables  denote  partial  differentation  with 
respect  to  the  subscripts). 

From  the  axioms  we  derive  the  relationships* 


R  _ 


‘Or  an  equivalent  wt  of  axioms. 
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Substituting  these  relationships  for  Ru  and  Rmm  in  the 
expressions  for  and  we  obtain 

[#,*( )  +  m,J(  ~  )  +  21 ,m,]Rlm  - 

i»*  —  R<1„  » 

(/,*(  i  +  ™,2(  '  fj*  )  +  21  ,m,]Rlm  - 

l9p  —  Iiii99  ~RmrnfW  f 

[/,/,(  )  +  m*m»(  —j~ )  +  l.m,  +  l,m,)Rtm  — 

Igf  —  Rlla,  ~Rm™Mg 


By  removing  fi|m,and  substituting  the  expressions  for  Ri 
and  Rm  defined  by  the  expressions  for  /,  and  /,,  we  produce 
two  partial  differential  equations  relating  surface  orientation  to 
image  irradiance: 

all,,  +  film,,  -  Oil,,  -  film.,  —  \ll„  -  XlU,  . 

all,,  +  film,,  -all,,-  film,,  —  \II„  -  \t  I,,  , 

where 

a  =“  l.m,  —  I,m,  , 

fi  =  r,t.  -  i.i,  , 

1  =  f»2(l  -  m2)  +  m,2(l  -  I2)  +  2l,m,lm  , 

6  =  l,2(  I  —  m2)  +  m,J(  1  -  f2)  +  2l,m,lm  , 

I  =  -  m2)  +  m,m,(l  -  I2)  +  {l.m,  +  l,m,)lm  , 

X  —  l,m,  -  f,m. 

These  equations  are  derived  by  using  the  two  axioms  and 
assumptions  of  differentiablity  of  the  surface  and  of  the  image 
intensities.  They  represent  the  relationship  between  surface 
orientation  and  image  irradiance  as  percieved  by  a  visual  system 
whose  beliefs  regarding  the  physical  laws  of  nature  are  described 
by  the  two  axioms. 


8  SPECIAL  CASES 

The  two  partial  differential  equations,  while  derived 
without  assumptions  about  the  surface  shape,  are  complex  and 
difficult  to  use  for  the  process  of  surface  reconstruction  from 
image  intensities.  Simplifying  assumptions,  which  are  naturally 
more  restrictive,  can  be  made  to  reduce  the  complexity  of  the 
equations  and  hence  make  them  more  tractable  for  the  task 
of  surface  reconstruction.  Three  cases  of  approximation  are 
presented:  ‘chain  mail'  restricted  ‘chain  mail',  and  spherical. 
The  spherical  approximation  is  free  of  surface  derivatives,  the 
restricted  chain  mail  introduces  curvature  in  a  limited  manner, 
and  the  chain  mail  adds  rate  of  change  of  curvature  in  a  limited 
manner.  Notice  that  the  restrictions  all  refer  to  surface  shape. 

The  chain  mail  approximation  is  so  called  because  we  im¬ 
agine  the  surface  as  being  covered  with  small  flat  plates  hinged 
together,  so  that  the  axes  of  the  binges  are  parallel  to  the  z  and 
y  axes  of  the  image.  The  component  of  the  surface  normal  in 
the  z  axis  direction  will  not  vary  in  the  y  direction;  that  is,  the 


hinge  is  rigid  and  similarly,  the  component  in  the  y  axis  direc¬ 
tion  will  not  vary  in  the  z  direction.  We  have  I,  ■—  0,  m,  ■“  0; 
consequently  l„  «**  0,  mu  «*  0,/,,  —  0,  and  m,,  —  0.  The 
two  partial  differential  equations  reduce  to 


I.m,l„ 


l.m,I..-l,2{-  —  )l„  , 

i  r  2,  I  "  i2 1, 

l.m,l„  —  m,  . 


The  restricted  chain  mail  approximation  further  requires 
that  the  'curvature'  of  the  chain  mail  be  constant.  Applying 
the  additional  restrictions  l„  =  0,  and  m„  *=  0,  we  obtain 
the  equations 

l-f2  _  J^lj, 

Im  m,  I„ 

I  —  m2  m,  I„ 

Im  I.  I,, 

The  spherical  approximation  assumes  that  we  are  on  a 
spherical  surface.  Besides  the  restrictions  in  the  restricted  chain 
mail  approximation,  a  spherical  surface  implies  I,  »=  mf ,  -  that 
is  constant  curvature  independent  of  direction.  For  this  case 
the  partial  differential  equations  become  relationships  between 
image  irradiance  and  the  direction  of  the  surface  normal: 

I  ~  t"2  _  Ij. 

Im  I„ 

lm  I,, 


These  results  for  the  spherical  approximation  are  equiv¬ 
alent  to  those  Pentland  was  able  to  obtain  [8]  through  local 
analysis  of  the  surface.  Using  this  technique,  the  only  assump¬ 
tions  he  needs  is  that  the  surface's  principal  curvatures  are  lo¬ 
cally  equal. 

These  special  cases  are  presented  because  the  resulting 
equations  are  less  complex  that  the  general  ones.  In  the  recovery 
of  surface  shape  from  image  irradiance  data,  initial  solutions 
(obtained  under  more  restrictive  assumptions)  are  often  neces¬ 
sary. 


0  RECOVERY  OF  SURFACE  SHAPE 

It  is  difficult  to  use  the  surface  orientation  -  image-  ir¬ 
radiance  equations  to  recover  surface  shape  from  image  inten¬ 
sities.  Since  an  analytic  form  for  ihe  image  intensities  is  not 
available,  numerical  procedures  must  be  employed.  Two  types 
of  approaches  are  possible.  Tbe  two  differential  equations  can 
be  integrated  in  a  step-by-step  manner  or,  given  some  initial 
solution  (possibly  tbe  spherical  approximation),  a  relaxation 
procedure  may  be  employed.  The  difficulities  posed  by  these 
approaches  are  usually  instability  on  the  one  hand  and  lack 
of  convergence,  on  the  other.  Currently  we  are  applying  this 
formulation  to  shape  recovery. 
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10  CONCLUSION 

Axiomitization  of  the  beliefs  a  visual  system  has  about  the 
laws  of  surface  radiance  can  provide  sufficient  constraints  to  re¬ 
late  the  irradiance  of  an  image  to  the  corresponding  scene’s  sur¬ 
face  orientations.  While  these  axioms  specify  a  class  of  functions 
that  include  many  that  describe  known  situations,  it  remains  to 
characterize  the  class  specified  by  the  axioms.  The  axioms  are 
presented  without  justification  except  that  they  include  com¬ 
mon  situations.  The  axioms  need  to  be  justified  on  the  basis  of 
physical  principles  such  as  rotational  invariance,  and  the  physics 
of  surface  reflection.  While  the  derived  equations  specify  the 
relationship  between  surface  orientation  and  image  irradiance, 
their  use  in  shape  recovery  still  needs  to  be  demonstrated 
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Figure  S  ‘Ideal’  Result.  Top  left  •  profile  of  recovered  surface,  top 
right  •  sine  slant,  black wtO,  whites',  bo’ tom  left  -  cosine  tilt,  blirkw  l, 
FV»0,  whiles -f  I,  bottom  right  sine  tilt 
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Figure  t.  N-  boundary  conditions;  planar  initial  solution  perpen¬ 
dicular  to  viewing  direction;  image  quantisation  04  X  #4. 


Figure  t.  Boundary  condition  curve  on  sphere's  surface  (square 
shape);  planar  initial  solution  perpendicular  to  viewing  direction:  imag< 
quantisation  04  X  04 


Figure  4.  Boundary  at  edge  of  sphere  given,  planar  initial  solution 
perpendicular  to  viewing  direction;  image  quantisation  04  X  64 


Figure  §.  Random  live  percent  of  points  fixed,  planar  initial  solution 
perpend icu'ar  to  viewing  direction;  image  quantisation  04  X  04. 


Figure  Q.  Boundary  at  edge  of  sphere  given,  cone  initial  solution; 
image  quantization  84  X  M 


Figure  7.  Boundary  at  edge  of  sphere  given;  planar  initial  solution 
perpendicular  to  viewing  direction;  image  quantization  32  X  32 


Figure  10.  Boundary  at  edge  of  sphere  given,  planar  initial  solution 
slanted  J  to  viewing  direction;  image  quantization  MXM 


Figure  •.  Boundary  at  edge  of  sphere  given;  planar  initial  solution 
perpendicular  to  viewing  direction;  image  quantization  16  X  16. 


Ab-Teoa  /At 


CONCEPT  MAPS 


l)a\  id  M.  McKcown.  Jr. 
Department  of  Computer  Science 
Carncgic-Mcllon  University 
Pittsburgh,  PA  I52131 


Abst  ract 

This  paper  describes  a  representational  mechanism  for  constructing 
3-dimensional  large  scale  spatial  organizations  suitable  for  applications 
in  areas  such  as  cartography  and  land  use  studies,  photo  interpretation 
for  rcconnaisance  and  surveillance,  and  geological  modeling  for 
resource  analysis.  It  focuses  on  the  representation  and  utilization  of 
map  information  as  a  knowledge  source  for  photo-interpreta.  on.  in 
panicular.  the  description  of  a  highly  detailed,  large  scale  geographic 
area:  Washington,  D.C..  Methods  of  data  acquisition,  query 

specification  and  geometric  operations  on  map  data  arc  discussed. 
These  ideas  have  been  implemented  into  a  working  map  database 
system.  CONCEPTMAP,  as  a  component  of  MAPS:  (Map  Assisted 
Photo-interpretation  System),  our  ongoing  research  in  interactive 
photo-interpretation  work  stations. 

1.  Introduction 

Consider  the  problem  of  building  a  system  capable  of  generating 
answers  to  representative  map  database  queries  such  as:2 

•  "How  many  bridges  cross  the  Potomac  River  between 
Virginia  and  the  District  of  Columbia." 

•  "Display  images  of  National  Airport  before  1976." 

•  "Whjt  is  the  closest  building  to  this  geographic  point." 

•  "Where  is  this  geographic  point” 


The  research  was  giomorcd  by  the  Defense  Advanced  Research  Projects  Agency 
MOO).  ARPA  Order  No  3597.  and  monitored  by  the  Air  (  wee  Avionics  laboratory 
under  Contract  113615-78-015$)  Hie  viewy  and  conciuoonv  in  ifm  document  arc  those 
o I  the  audio  and  rtwuM  not  he  (ntcrprcicd  as  representing  the  official  policica.  either 
cs  pressed  or  trnphed.  of  the  Defense  Advanced  Research  "rejects  Agency  or  the  U  S 
Government 


Possible  solutions  to  the  query  problem  range  from  pre-computation 
and  storage  of  potentially  huge  numbers  of  spatial  relationships,  to 
dynamic  compulation  involving  both  costly  search  and  complex 
geometric  analysis.  We  favor  dynamic  (on  demand)  computation  of 
geometric  relationships,  constrained  by  user  defined  structuring  of  map 
features  and  utilizing  natural  spatial  decomposition.  However, 
whatever  the  query  resolution  mechanism,  a  representation  for  the  rich 
variety  of  man-made  and  natural  features  must  underlie  any  such 
system.  Further,  in  order  to  be  relcvent  to  the  needs  of  the  photo- 
interpreter.  the  results  of  the  query  should  be  portrayed  in  terms  of  the 
display  of  digital  imagery,  at  the  appropriate  image  resolution.  While 
many  queries  can  be  answered  as  purely  "factual"  responses,  (ie.  8 
bridges  cross  the  Potomac  between  firgina  and  District  of  Columbia ),  in 
our  system  we  arc  able  to  quickly  show  the  user  the  location,  relative 
position,  and  scene  context  directly  through  our  aerial  imagery,  as  well 
as  providing  the  necessary  textual  and  descriptive  information.  This  is  a 
key  point  of  departure  from  many  "geographical"  or  "spatial 
information  systems"  [3. 4.  5]in  that  they  simply  provide  tabular  lookup 
for  geographic  "facts",  and  vector-based  display  of  digitized  map  data. 

Our  map  database  consists  of  a  collection  of  concepts  each  describing 
large  spatial  features  such  as  political  areas  ( slates,  counties,  towns...), 
business  and  residential  areas,  parks  and  natural  features  ( rivers, 
streams,  lakes...).  The  same  concept  representation  is  used  to 
hierarchically  describe  man-made  features  airports,  power  stations. 
universities.  industrial  sites....  A  window-oriented  raster  image  display 
facility.  BROWSE  (9).  is  the  man-machine  interface  for  the  concept 
map  database  and  is  used  to  create,  edit,  and  disnlay  concept  features 
superimposed  on  21)  aerial  photography,  and  to  generate  arbitrary  3D 
scene  views. 

2Wc  ire  not  concerned  with  the  natural  language  iwue.  of  such  queries  nur  query 
interface  «  haned  on  a  cnmhtnaimn  of  query  template  matching,  geometric  qwcincattoh. 
and  interactive  coordinate  input  through  ranter  tmag c  display 
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In  this  paper  wc  will  explore  many  of  the  issues  raised  in  map  feature 
representation  and  query  resolution,  and  describe  the  current 
implementation  of  the  LONCII'I  MAI*  database  of  the  Washington 
DC  area.  In  the  following  section  we  give  an  overview  of  the  MAPS 
system  components. 

2.  MAPS  Components 

M  AI’S  represents  ongoing  research  in  the  areas  of  interactive  aids  for 
photo-interpretation,  image/map  spatial  databases,  and  image 
understanding.  Various  components  of  the  system  have  been  described 
in  |7]  [integration  of  terrain,  imagery,  and  map  database s],  [8]  [system 
goals  and  design ).  and  [9]  [BROWSE  nindnn-or tented  display  system]. 
For  completeness  we  will  highlight  the  major  capabilities  of  the  MAPS 
components,  but  the  reader  is  referred  to  the  above  for  more  detail. 

2.1.  Imago  Database 

We  have  been  working  with  an  online  database  of  approximately  40 
aerial  mapping  photographs  providing  spatial  and  temporal  overlap, 
centered  over  the  Washington  D.C.  area.  Fach  photograph  has  been 
digitized  with  a  100  micron  aperature  to  approximately 
2200x2200x8bits/pixel.  Original  photography  scales  range  from 
1:12000  to  1:36000.  and  wc  have  recently  acquired  and  begun  to 
integrate  20  new  digitized  images  at  1:60000  scale  into  the  database. 
Associated  with  each  image  arc  several  files,  among  which  are: 

•  scene  description  file.  Contains  scene  and  image 
formation  information  such  as  camera  type,  aircraft  platform 
data,  geodetic  comer  points,  digitization  data,  and  source  of 
data. 

•  correspondence  file  Contains  image/map 

correspondence  points  for  known  ground  control  points, 
litis  file  is  interactively  generated  and  modified  by  the 
image- to-map  correspondence  component 

•  coefficients  file.  Contains  image-to-map  camera 
calibration  coefficients,  error  function  description,  and 
reference  to  the  associated  correspondence  file. 

2.2.  DLMS  Database 

We  have  adapted  and  restructured  a  geodetic  based  polygon  feature 
database  (DLMS  Level  I)  (1)  and  a  digital  terrain  elevation  database 
(USGS  DTED)  provided  by  the  Defense  Mapping  Agency,  to  allow  for 
efficient  feature  access  based  on  geodetic  coodinates  or  feature 
attributes.  This  database  (DLMS3D)  provides  a  fairly  coarse 
description  of  major  natural  and  cultural  terrain  features  and  is  used  as 
a  background  basis  for  3D  display  of  urban  scene  simulations. 


2.3.  BROWSE  Window  Oriented  Display 

BROWSE  is  a  window-oriented  display  manager  which  supports 
raster  image  display,  overlay  of  graphical  data  such  as  map  descriptions 
and  image  processing  segmentations,  and  the  specification  and 
generation  of  3D  shaded  surface  models.  Digitized  imagery  from  black 
and  white  and  color  aerial  mapping  photographs  is  displayed  by 
BROWSE  at  multiple  levels  of  resolution  and  allows  for  dynamic 
positioning,  zooming,  expansion  or  shrinking  of  the  image  window. 
Map  data  represented  as  vectors  and  polygons  can  be  superimposed  on 
the  imagery  through  image-to-map  registration.  Access  to  collateral 
map  datab~scs  and  terrain  models  may  be  accomplished  using  the 
BROWSE  graphical  interface.  Finally,  the  window  representation  gives 
a  convenient  communication  mechanism  for  passing  image  fragments 
to  image  interpretation  programs,  which  generally  run  as  separate 
processes.  The  results  of  such  processing  can  be  returned  to  BROWSE 
for  further  processing  by  the  user. 

BROWSE  is  used  regularly  as  a  fironi-cnd  for  image  processing  and 
database  programs  in  the  MAPS  system  and  also  as  a  general  purpose 
image  display  facility. 

2.4.  Landmark  Database 

A  landmark  database  (LANDMARK)containing  approximately  180 
ground  control  points  over  the  Washington  D.C.  area  has  been  created. 
Each  landmark  entry  consists  of  the  geodetic  coordinate  latitude, 
longitude.  rlemtiun>.  a  textual  description  of  the  landmark,  and  a 
representative  image  fragment  which  defines  the  ground  position  for 
the  interactive  user.  Entries  in  the  landmark  database  may  be  selected 
by  name,  geodetic  location,  or  by  interactive  menu  selection  from  a  raster 
display  of  image  fragments. 

2.5.  Imags-to-Map  Correspondence 

An  interactive  image-to-map  correspondence  component  (CORRES) 
uses  the  landmark  database,  the  image  database,  and  BROWSE 
window  display  primitives  to  allow  a  user  to  graphically  select  a 
landmark  and  indicate  the  corresponding  point  in  the  new  image.  After 
the  specification  of  the  first  corresponding  point.  CORRES  can 
generate  an  initial  guess  of  the  map  coverage  using  flight  line,  image 
scale  and  digitization  data  from  the  scene  description  file  stored  in  the 
image  database.  Landmark  candidates  are  graphically  superimposed  on 
the  new  image,  allowing  novice  users  to  select  landmarks  with  little 
domain  knowledge.  Since  each  landmark  has  an  associated  image 
fragment  we  could  extend  these  interactive  techniques  to  a  more  semi¬ 
automatic  system  which  would  perform  Image  fragment  matching  to 
calculate  a  set  of  local  correspondence  points  within  the  landmark  area, 
possibly  resulting  in  s  more  robust  match. 
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2.6.  Hand  Segmentation 

An  interactive  human  segmentation  system  (SEGMENT)  was  one  of 
the  earliest  tools  developed  for  the  MAPS  system.  It  allows  the  user  to 
specify  the  position  and  shape  of  a  map  feature,  as  well  as  capabilities  to 
edit  and  display  segmentations  in  multiple  levels  of  detail. 
Segmentation  files  arc  used  as  intermediate  representations  for  the  map 
database.  Facilities  to  convert  image  based  descriptions  into  map  based 
descriptions  or  to  project  map  based  descriptions  onto  new  imagery  arc 
provided. 

2.7.  Machine  Segmentation 

A  i  experimental  coarse-fine  segmentation  system  (MACHINKSKG) 
using  region -growing  and  edge  profile  analysis  has  been  successfully 
used  to  extract  map  features  such  as  buildings,  roads,  and  bridges  from 
our  aerial  imagery.  MACHINKSKG  uses  a  coarse  hand  or  map 
segmentation  to  specify  the  area  within  which  a  detailed  machine 
segmentation  should  be  performed.  The  user  can  accept,  reject  or  edit 
the  segmentation  descriptions  as  they  arc  generated. 

2.8.  Pasture  Mensuration 

A  simple  image  based  feature  measurement  system 
(PHOTOGRAM)  is  currently  under  development  to  provide  accurate 
map  feature  ground  measurement  data  for  integration  with  the  map 
database.  This  system  uses  the  BROWSE  window  display  system  and 
the  image  database  to  calculate  Unear  distance,  rectangular  ana, 
polygon  ana,  and  radial  distance. 

3.  Map  Database 

The  concept  map  database  component  of  MAPS  is  central  to 
providing  access  to  imagery,  guiding  photo  interpretation,  and 
processing  queries  about  manmade  and  natural  features.  Through  the 
image- to-map  correspondence  process,  map  knowledge  can  be  applied 
to  any  mage,  and  the  spatial  relationships  of  sets  of  imagery  can  be 
established.  The  concept  map  provides  a  framework  within  which 
individual  map  features  can  be  associated  with  high-level  semantic  map 
descriptions.  Concept  maps  capture  the  spatial  arrangement  in  urban 
arras  of  neighborhoods,  political,  and  geographical  boundaries.  For 
example,  terms  such  as  "Northwest  Washington",  "Georgetown", 
"Foggy  Bottom".  'Alexandria,  Virginia”  arc  often  used  to  describe 
general  area  within  and  around  Washington  D.C,  They  provide  an 
important  mechanism  for  symbolic  access  into  an  image  database,  ej. 
"display  images  of  Georgetown  later  than  1976".  However,  depicting 
precise  boundaries  of  conceptual  features  from  serial  Imagery  is  a 
difficult  problem.  In  many  cases  boundaries  are  ai- defined  and  highly 
dependent  on  the  user’s  own  spatial  model,  which  often  corresponds  to 


j  hierarchy  of  Inch  of  duail  jmong  in.ip  Icjturcs.  I  he 
CONCKHMAP  d.it.tb.isc  allows  users  the  flexibility  of  describing  this 
hierarchy  in  terms  of  a  geodetic  coordinate  system,  independent  of  any 
particular  image,  while  using  the  imagery  directly  as  the  medium  of 
input 

Concept  map  features  can  be  directly  used  to  partition  large  scale 
spatial  areas  based  on  natural  spaual  relationships  such  as  containment, 
subsumed  by  and  intersection.  Using  these  relationships,  which  often 
arise  in  database  queries,  rather  than  artificial  cellular  or  raster 
organizations  traditionally  used  for  spatial  decomposition  appears  to 
better  model  the  performance  of  human  map  interpreters. 
Additionally,  as  we  will  describe  in  Section  6,  many  queries  into  the 
map  database  can  be  resolved  at  the  symbolic  level  through 
manipulation  of  spatial  relationships  without  resorting  to  geometric 
computations.  In  the  following  section  we  will  describe  the 
representation  and  organization  of  concepts  in  the  map  database. 

4.  Concept  Map  Representation 

Each  entity  in  the  concept  map  is  represented  by  a  concept  schema. 
The  schema  is  given  a  unique  ID  by  the  database  and  the  user  specifies 
a  ’symbolic'  print  name  for  the  concept  Each  concept  may  have  one  or 
more  Ifilt  schema  associated  with  it.  The  practical  effect  of  multiple 
roles  is  to  allow  for  differing  views  of  the  same  geographic  concept,  ie„ 
"northwest  Washington"  has  a  roles  of  nsidemial  area,  as  well  as 
political  while  sharing  ihc  same  3D  map  description.  A  principle  role  is 
assigned  by  the  user,  indicating  a  preferred  view  or  a  role  whose  3D 
map  description  defines  Ihc  concepts’  spatial  extent  Figure  1  gives  the 
organization  of  the  concept  schema.  Ihc  CONCKPTMAP  database  is 
composed  of  lists  of  concept  schema,  with  access  functions  based  on 
Symbolic  name,  geodetic  coordinate  and  spatial  rtlatio-tships. 

4,i.  Roto  Schoma 

The  B2l£  schema  depicted  in  Figure  2  contains  the  definition  of  a  role 
name  and  further  specification  by  subrole  name,  a  description  of  role 
class  (ie-,  buildings  may  be  government  residential,  commercial  etc.). 
The  role  type  attribute  addresses  the  issue  of  whether  the  role  is 
physically  realized  in  the  scene  (image),  or  is  a  conceptual  feature  such 
as  cultural  (neighborhood)  political,  or  geographic  boundaries. 
Further,  role  type  allows  die  user  to  define  a  mil  schema  as  a  collection 
of  aggregrate  physical  or  conceptual  features.  For  example,  the  concept 
"district  of  Columbia"  has  role  type  aggngrate-concepiual  with 
agyegrase  roles,  "northwest  Washington",  "northeast  Washington", 
"southwest  wartiingson",  mid  "southeast  Washington".  This  mechanton 
allows  (he  dmabaae  to  explicitly  represent  concepts  which  are  strictly 
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ROLE  *  HEMA 


CONCEPT  SCHEMA 


‘assigned  by  database,  unique  ID> 

<user  defined  symbolic  name'  string* 
^number  of  roles  defined  for  concept> 
‘default  ro>e  interpretation  for  concept> 
<1  ist  of  role  id' s> 


Concept  ID 
Concept  Name 
Role  Count 
Principle  Role 
Role  List 
Role  10  t 

Role  ID  n 


Role  ID 
Role  Name 
Subrole  Name 
Role  Class 
Role  Type 
Role  Derivation 
Role  Mark 
Role  3D  ID 
Role  Defn  ID 
Aggregrate  List 
Role  ID  1 

Role  ID  n 


‘assigned  by  the  database> 

‘major  role  {bridge,  airport,  un  ivers ity  .  . . )> 
‘further  specification  of  role> 

‘generic  class  (industrial,  government . . . )> 
‘physical  or  conceptual,  single  or  collection* 
‘method  by  which  role  was  added  to  database* 
‘internal  use  during  database  query* 

‘assigned  by  database  for  1  at/lon/el evation  dest 
‘assigned  by  database  for  role  attribute  specify 
‘list  of  user-defined  component  roles* 


figure  I :  Concept  Schema 

composed  of  other  concept  roles,  and  can  be  used  in  query  resolution  as 
a  form  of  inhcrilcncc.  That  is  to  say.  attributes  such  as  population  of 
"district  of  Columbia"  can  be  calculated  by  examining  the  attribute 
values  of  its  aggregrate  roles.  Similar  operations  based  on  geometric 
calculation  of  spatial  containment  provide  a  more  flexible  mechanism 
for  such  analysis. 

Other  role  schema  attributes  arc  role  derivation  and  role  mark.  Role 
derivation  accounts  for  the  method  by  which  the  role  and  3D  ID 
descriptor  were  added  to  the  concept  map  database.  Role  mark  is  used 
to  mark  nodes  during  query  search,  and  during  creation,  defe'ton  and 
modification.  I  ash  mis’  sili.-ma  contains  a  unique  3D  ID  which  defines 
a  set  of  <ldtmidc/liingiludc/elcwiliiin>  triples  which  position  the  role  in 
map  space.  Ihc  3D  description  allows  lor  point,  line,  and  polygon 
features  as  primitives,  and  the  aggregation  of  primitives  into  more 
complex  topologies,  ic.  regions  with  holes,  diseontinous  lines,  and 
poir.t  lists,  figure  3  gives  a  list  of  the  current  dig  schema  attribute 
values. 


Figure  2:  Role  Schema 

4.2.  Further  Role  Specification 

Associated  with  each  role  name  there  is  a  detailed  role  property 
template  which  further  specifics  role  context  dependent  attributes  or  the 
subrole.  For  instance,  for  the  role  name  residential  area  the  subrolcs 
may  be  single  family,  mixed  housing,  apartment  complex,  rural.  The 
role  property  template  contains  slots  for  population,  housing  density, 
roof  and  tree  cover  as  a  percentage  of  area,  and  other  attributes.  In  the 
absence  of  specification  by  the  user,  default  attribute  values  arc  used, 
within  the  context  of  the  subrolc.  Users  may  dynamically  create  new 
subrolcs.  and  use  existing  or  newly  specified  attribute  defaults.  The 
addition  of  a  new  role  name  and  associated  role  property  template 
requires  intervention  by  the  system  maintaincr.  Figure  4  gives  a  list  of 
the  current  subrole  attribute  values  for  the  roles  buildings,  bridges,  and 
airport. 

Figure  5  gives  a  partial  list  of  llie  current  concent  symbolic  njmes 
and  associated  role  ids.  As  of  this  writing  there  are  1 10  concepts  with 
183  roles  in  the  CONCKFI  MAP  database.  We  plan  to  incrementally 
increase  the  complexity  of  the  database  both  in  terms  of  number  of  map 
features  represented  and  the  richness  of  the  underlying  representation. 


HOIK: 

unknown 
bridge 
reservoir 
residential  area 
ROLE-TYPES: 

unknown 

•ggregrate-conceptual 
ROLE -CLASS: 
unknown 
residential 
ROLE-DERIVATION 
unknown 

terminal - interaction 


16  role  names : 
univers Ity 
political 
sports  complex 
geographic 

6  role  types: 
physical 


building 

road 

airport 

Industrial  area 
aggregrate -physical 


6  role  classes: 

industrial  transportation 

government  cultural  feature 

6  derivation  classes: 
hand- segmentat ion  l andmark-descr iption 


ROLE -MANX: 
none 

new- concept 
modlfy-role 


9  merk  classes: 

geo-query  temp late- query 
new-role  modlfy-concept 
new- 3D  modlfy-30 


water 

park 

hospital  complex 
parking  lot 

conceptual 


natural  faature 
commercial 

machine-segmentation 


Figure  3:  Role  Schema  Attribute  Value* 
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ROLE:  building  State  information  for  rola  (1) 

Sub  role  file:  /usriO  vdata/maps/building-dyn 


unknown 

office  building 
government  building 
concert  hall 

10  file  status:  KEY:  BULD' 


performing  arts  complex 
railroad  station 
administration 
terminal  building 
Min:  1  Max:  38  Active: 


museum 

dormitory 

memorial 

32  Next:  39 


ROLE:  bridge  State  information  for  role  (2): 

Sub  role  file:  /usriO/vdata/maps/bridge. dyn 

unknown  railroad  pedestrian 

automobile 

ID  file  status:  KEY:  BRDG’  Min:  1  Max:  8  Active:  8  Next:  9 


ROLE:  airport  State  information  for  role  (5): 

Sub  role  file:  /usriO/vdata/maps/airport. dyn 


unknown 

operations  building 
hangars 

ID  file  status:  KEY:  ’AIRp- 


commercial 

terminal 

navigational  beacons 
Min:  1  Max:  10  Active:  10 


mi  1  itary 
runway 

Next:  11 


Figure  4:  Subrolc  Attribute  Values:  building,  road,  airport 


5.  Some  Examples 

In  this  section  we  will  briefly  describe  three  sample  concept  map 
entries  taken  from  the  Washington  l).C  concept  map  database.  'Hicse 
examples  illustrate  the  flexibility  of  the  concept  map  representation  and 
were  created  by  interactive  query  to  the  database. 


points  in  database  image  'dc38617\  Using  the  image-to-map 
correspondence  for  ’dc38617\  geodetic  coordinates  are  calculated. 
Ground  elevations  arc  calculated  by  lookup  and  interpolation  from  our 
digital  terrain  elevation  database  (1).  The  original  image  coordinates 
are  saved  for  possible  refinement,  and  are  accessable  through  the 


’D3ID’  attribute. 


5.1 .  Map  Feature  Concept 

Figure  6  shows  a  typical  map  feature  entry  in  the  CONCF.PTMAP 
database.  This  entry,  Washington  circle’,  (a  traffic  circle  in  the  Foggy 
Bottom  area)  was  created  during  an  interactive  terminal  session.  Figure 
7  gives  the  latitude,  longitude  elevation)  description  for  the 
Washington  circle’  conccptmap  entry  which  is  defined  in  the  role 
schema  as  D3ID3’  and  was  created  by  interactive  specification  of  image 


Figure  8  gives  the  conccptmap  entry  for  2D  feature  description  for 
Washington  circle’.  Simple  shape  features  such  as  centroid .  area, 
perimeter,  and  fourier  coefficients  are  calculated  from  the  role  schema 
D3ID  in  map  coordinate  space  and  arc  used  by  our  MACHINSFG 
system  in  conjunction  with  the  D31D  to  specify  location  and  shape  of 
map  features. 


C0NCEPT1 

CONCEPT? 

C0NCEPT3 

CONCEPT 4 

C0NCIPT5 

CONCEPT 6 

C0NCEPT7 

CONCEPTS 

CONCEPT9 

CONCEPT  10 

CONCEPT  1 1 

C0NCEPT12 

C0NCEPT13 

C0NCEPT14 

CONCEPT15 

CONCEPT  16 

CONCEPT 17 

C0NCIPT16 

CONCEPT 19 

CONCEPT20 

CONCEPT? 1 

CONCEPT?? 


tidal  basin  NATE R 1 
district  of  Columbia  POLI1  RES12 
northwest  Washington  POII2  RESI1 
nacmillian  reservoir 
southwest  Washington 
northeast  Washington 
Virginia  POL  IS 
Maryland  POLI6 
kenned*  center  BULD1 
ellipse  PARK  1 
Washington  circle  R0A01 
state  department  BULD? 
executive  office  building  8ULD3 


POL  1 1 
POL  1 2 
RESV1 
POL  1 3 
POL  1 4 


BULD9 


white  house  BUL04 
treasury  building  BULD 5 
department  of  commerce  6ULD6 
arl  ington  memorial  Bridge  BRDG1  - 
rfk  stadium  SPONT1 

museum  of  history  and  technology  BUL07 
key  bridge  MOG2 
kut*  bridge  BROW  w 
george  meson  bridge  6RDG4  w 


C0NCEPT56 
CONCEPTS 7 
C0NCEPT58 
C0NCEPT59 
CONCEPT60 
C0NCEPT61 
C0NCEPT62 
C0NCEPT63 
C0NCEPT64 
C0NCEPT65 
CONCEPT66 
CONCEPT67 
C0NCEPT68 
C0NCEPT69 
CONCEPT 70 
C0NCCPT71 
CONCEPT72 
CONCCPT73 
C0NCEPT74 
CONCEPT78 
CONCEPT  76 
CONCEPT?? 


national  airport  AIRP1  BULD17  AIRP3  AIRP4 

u.S.  capitol  PARKS  BULD 18 

alexandria  P0LI7  RESI6 

old  town  alexandria  RCSI6 

Washington  navy  yard  INOU2 

boiling  air  force  base  AIRPS 

andrews  air  force  base  AIRP6 

american  pharmacautical  association  BUL019 

national  academy  of  sciences  8ULD20 

federal  rasarve  board  BULD21 

national  sciance  foundation  BULD22 

civil  sarvict  coamission  BULD23 

interior  department  BUL024 

district  building  BULD? 6 

lafayette  park  PARKS 

constitution  hall  BUL026 

national  press  building  BULD? 7 

23rd  street  ROADS 

constitution  avenue  R0A010 

Virginia  avtnue  ROAOU 

c  streat  ROAD12 

22nd  street  ROAD13 


figure  5:  Washington  D.C.  CONCEPTS  and  Role  IDs  (partial  list) 
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Concept:  "••shinglon  circW 

Concept  Id:  "COMCtmi"  1  rolet  (principle  role:  0) 

[0]  eesmngton  circle 

Pole  ID:  1 P0AD1 '  Pole  Oefn  ID: 

Pole  name :  'rood'  subrolo:  "traffic  circle" 

Pole  class:  "transportation'  type:  "pHysical" 
Role  deriv:  *  terminal -interaction ' 

Role  mark:  "none" 

30  Pole  ID:  "03103"  30  Pole  pointer  *0 


figure  6:  Washington  Circle 


5.2.  Landmark  Concept 

Figure  10  lists  the  concept  map  entry  for  'gcorge  mason  bridge’.  The 
role  schema  attribute  role  derivation  specifies  this  concept  as  being  a 
'landmark-description*.  When  listing  this  role  entry  the  concept  schema 
attribute  concept  name  is  used  to  index  into  the  landmark  database 
(LANDMARK) [8]  to  produce  the  textual  description  which  defines 
the  landmark  entry.  This  allows  entries  in  our  landmark  database  to  be 
directly  accessible  through  the  concept  map. 


14  Points  Generic  name: 
Maximum  coordinate:  lat 
minimum  coordinate:  lat 


point 
point 
point 
point 
point 
point 
point 
point 
point 
point 
point 
point  11 
point  12 
point  13 


elev: 
elev: 
elev : 
elev : 
elev : 

•  lev : 
elev: 
elev : 
elev : 

•  lev  : 
elev 
elev 
elev 
elev 


16  meters 

16  meters 

17  meters 
17  meters 

17  meters 

18  meters 
18  meters 
17  meters 
16  meters 
16  meters 

16  meters 
16  meters 
16  meters 
16  meters 


dc38617’  Feature  type:  'areal* 

#38  54  10  (487)  Ion  #7?  3  3  (829) 

*38  54  7  (62)  Ion  #77  2  59  (326) 
lat  #38  54  9  (5 2)  Ion  #77  3  3  (829) 
lat  #38  54  10  (29)  Ion  #77  3  3  (131) 
lat  #38  54  10  (464) 

lat  #38  54  10  (487) 

lat  #38  54  10  (428) 

lat  #38  54  9  (752) 


lat  #38  54  9  (88) 

lat  #38  54  8  (101) 

lat  #38  54  7  (294) 

Tat  N38  54  7  (62) 

lat  N38  54  7  (83) 
lat  N38  54  7  (555) 
lat  #38  54  8  (554) 
lat  #38  54  9  (52) 


Ion  W77  3  2  (265) 
Ion  #77  3  1  (397) 
Ion  #77  3  0  (529) 
Ion  #77  2  59  (656) 
Ion  #77  2  59  (325) 
Ion  #77  2  59  (369) 
Ion  #77  3  0  (227) 
Ton  #77  3  1  (92) 

Ion  #77  3  2  (286) 
Ion  #77  3  3  (270) 
Ion  #77  3  3  (715) 
ion  #77  3  3  (829) 


Figure  7:  Washington  Circle  3D  Database  Entry 


clockwise 

area  •  12.201516  square  sec  perimeter  -  12.704513  sec 
fractional  fill  •  0.791007  compactnesa  •  13.228246 

centroid:  let  #38  54  8  (772)  Ion  #77  3  1  (527) 

centroid  of  border:  lat  #38  64  8  (771)  Ion  #77  3  1  (531) 

length  of  major  axis  (flttad  ellipse)  •  4.323209  seconds 

length  of  minor  axis  (fitted  ellipse)  -  3.587450  seconds 

major  angle  (fitted  ellipse)  ■  0.001446  radians  (0.08  deg) 
minor  angle  (fitted  ellipse)  ■  1.672243  radians  (90.08  deg) 
tour  lor  coefficients  (order  1  to  9): 


1). 

»*: 

0.2383 

1 . 7770 

•y: 

2.1423 

«>y 

-0.2884 

1). 

ax: 

-0.0074 

bx: 

-0.0017 

ay: 

-0.0028 

by: 

-0.0035 

3). 

ax: 

0.0343 

bx : 

0.0517 

ay: 

0.0509 

by: 

-0.0117 

«)■ 

ax: 

0.0016 

bx : 

-0.0045 

ay: 

0.0012 

by: 

0.0C37 

5) 

ax: 

-0.0029 

bx : 

-0.0124 

ay: 

0.0136 

by: 

0.0034 

«> 

ax: 

0 . ooos 

bx : 

-0.0088 

ay: 

0.0104 

by: 

0.0014 

U 

ax: 

0.0091 

bx : 

0.0031 

ay: 

0.0108 

by: 

0.0031 

8) 

ax: 

-0.0097 

bx : 

-0.0030 

ay: 

0.0126 

by: 

-0.0099 

») 

ax: 

-0.0036 

bx : 

-0.0032 

ay: 

0.0088 

by: 

0.0011 

Figure  8:  Washington  Circle  2D  Shape  Descriptors 

The  photograph  in  Figure  9  was  created  by  CONCEPTMAP  as  a 
result  of  the  query  " Display  all  images  containing  'Washington  circle ”. 
Using  the  BROWSE  subroutine  package  as  primitives,  a  display  frame 
is  created  composed  of  windowed  image  fragments  centered  around  the 
map  feature.  Once  displayed,  any  of  the  windows  can  be  manipulated 
using  commands  within  CONCEPTMAP.  Thus,  the  user  can  select  one 
or  more  of  the  image  fragments,  expand  the  Stic  of  the  window  to 
obtain  more  image  context,  move  a  window  for  nde-by-side 
comparison,  room  in  fix  more  detail,  or  adjust  the  center  of  the  window. 


Concept:  george  mason  bridge* 

Concept  i 0 :  C0NCEPT22"  1  roles  (principle  role:  0) 

[0]  george  mason  bridge 

Role  ID:  * BR0G4 ’  «oT#  Deffl  ID: 

Role  name:  ’bridge*  subrole:  'automobile* 

Role  class:  ’transportation*  type:  ’physical* 

Role  deriv:  ’ landmark -description ’  mark:  ‘unknown* 

3D  Role  ID:  *031014*  30  Role  pointer  80 

latitude  38  52  43  300 
longitude  77  2  22  500 
elevation  12  meters 

1140.1482  In  /v isf /wa$hdc/asc/dcl419/ lbw . img 
landmark  image  at  resolution  1 

george  mason  bridge 

Definition’  A  bridge  spanning  the  Potomac  River  in  southwest  DC. 
Located  adjacent  to  the  Jefferson  Memorial  and  the  Rochambeau  Bridge. 

Description:  The  George  Mason  Bridge,  also  known  as  one  of  the 
twin  14th  St.  Bridges,  carries  the  westbound  lanes  of  US.  1 
across  the  Potomac  from  14th  St.  on  the  east  bank  to  the  Jefferson 
Davis  Highway  on  the  west.  The  landmark  image  Is  oriented  with 
north  et  the  top. 


Figure  10:  Role  Schema:  George  Mason  Bridge 


5.3-  Multiple  Role  Concept 

The  concept  ’national  airport’  is  an  example  of  a  more  complex 
organization  of  iq)£  schema.  Figure  11  shows  the  current  concept 
description  for  ’national  airport’.  The  principle  role  AIRP1  defines  this 
concept  to  be  a  commercial  airport,  whose  boundary  should  be 
interpreted  as  a  aggregrate-physical  feature,  that  is  a  collection  of 
physically  realizable  boundary  descriptions.  Within  the  context  of  the 
area  represented  in  ’D31D59’  will  be  found  all  roles  which  comprise  this 
concept 

The  other  roles  define  the  airport  terminal  building,  a  runway,  and  a 
collection  of  hangars.  The  terminal  building  ’BULD17’  and  die 
airplane  hangar  ’AIRP4’  have  boundary  descriptions  associated  with 
them,  while  the  runway  AIRP3’  role  has  none.  Geometric  queries  on 
the  concept  map  database  would  find  the  terminal  building  and  hangar 
as  contained  within  the  principle  role  of  the  concept  CONCKPT55’. 
"national  airport".  However,  a  symbolic  query  asking  fix  all  the  roles 
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Figure  9:  Database  Retrieval  of  Multiple  Views 


associated  with  the  principal  role  for  concept  CONCHFI55  would  Wp  ^  ,w 

find  the  ntnwa,  a,  well  a,  the  ether  ,w„  roles  lhe  relative  merits  and  ^  *’**•""“  —*  **■*««  »  -* 

limitations  of  sttic,  user  defined  symbol*  represent  *  uslng  ^ 

- - - e^h^ 


structures. 


We  have  begun  to  create  detailed  models  for  airports  and  industrial 
area  roles.  iniPally  as  a  guide  to  the  interactive  user,  but  enpect  to 
integrate  such  static  descriptions  into  an  active  query  component  in  the 
future.  We  are  calling  such  descriptions  ji/e  description  models  6  MaP  Query 

Currently,  users  are  free  to  describe  as  role  schema  those  portions  of  the  l7)erc  *re  four  database  access  pnmit-ves  which  can  be  employed 
airport  description  model  that  are  of  importance,  without  a  requirement  singularly  or  in  combination  to  catract  the  positimal.  „ 

to  create  a  completely  specified  airport  site  description  model.  Figure  relational  attributes  required  to  answer  the  queries  posed  in  the 

12  details  a  preliminary  organization  of  an  airport  description  model  beginning  of  this  paper. 
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Concept:  'national  airport* 

Concept  id:  C0MCEPf55‘  4  roles  (principle  role:  0) 

[0]  national  airport 

Role  ID:  ’  AIRP1 '  Role  Defn  ID:  " 

Role  name:  airport*  subrole:  ’commercial' 

!  Hole  class:  ‘transportation*  type:  ' aggregrate-physical ' 

Rote  deriv:  ‘hand-segmentat ion '  mark:  'unknown* 

3D  Role  ID:  ’ 03 IDE.  i  ‘  3D  Role  pointer  90 
[1]  national  airport 

Role  ID:  ' BUL01 7 '  Role  Defn  ID:  ’* 

1  Role  name:  'airport*  subrole:  'terminal* 

Role  class:  * transportat ion  *  type:  'physical* 

l  Role  deriv:  *  terminal -interaction  *  mark:  ’unknown* 

|  i.  Role  ID:  *031060*  3D  Role  pointer  90 

[2")  national  airport 

j  Role  ID:  'AIRP3'  Role  Defn  ID:  *' 

Role  name:  'airport'  subrole:  'runway* 

\  Role  class:  ’transportation'  type:  physical’ 

t  Role  deriv:  ‘hand-segmentation’  mark:  ’unknown* 

3D  Role  ID:  3D  Role  pointer  90 

[3]  national  airport 

Role  ID:  ’AIRP4 ’  Role  Defn  ID:  " 

j  Role  name:  airport*  subrole:  'hangars* 

Role  class:  transportation*  type:  ‘physical* 

I  Role  deriv:  ' terminal  -  interact  ion '  mark:  'unknown' 

30' Role  10:  D3I061*  30  Role  pointer  90 


Figure  II:  Concept  Schema  for  National  Airy  >rt 
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Figure  12:  Kolc  Definition  Template  for  AIRFOK I  Role 


6.1 .  Signal  Access 

Display  or  list  all  concepts  within  an  interactively  specified  image 
area.  Signal  access  requires  an  cxplicilc  image-lo-map  correspondence. 
Image  coordinates  arc  used  to  calculate  map  coordinates  which  are  used 
to  search  the  concept  map  database.  Figure  13  is  a  display  frame 
created  by  CONCHPTMAP  as  a  result  of  an  interactive  user  query  to 
display  the  area  around  a  set  of  storage  tanks  near  the  Washington  D.C. 
navy  yard.  The  query  area  is  superimposed  as  a  blue  overlay  in  each  of 
the  display  windows,  the  area  of  interest  is  centered  in  each  window, 
and  displayed  at  the  highest  resolution  that  fits  within  the  window 
partition.  Signal  access  queries  are  purely  dynamic,  involving  only  the 
BROWSE  window  manager  and  the  image  database  and  do  not  use  the 
concept  map  symbolic  data  structures. 


6.2.  Symbolic  Access 

Display  or  list  all  concepts  with  a  given  symbolic  name.  Requires 
explicite  mapping  of  a  usci  defined  name,  immemorial  bridge )  into  the 
map  coordinate  system.  As  we  described  in  section  5.1,  the  role  schema 
3D  ID  gives  us  a  direct  mechanism  for  searching  the  image  database. 

6.3.  Role  Template  Accees 

Given  a  completely  or  partially  specified  role  schema,  find  all  roles  in 
the  concept  map  database  which  satisfy  ihc  specification.  The  user  can 
specify  additional  constraints  based  on  the  role  property  template  if  the 
role  name  and  subrole  name  have  been  specified.  The  result  of  a 
template  access  is  a  list  of  role  schema  ID'S.  These  may  be  printed  or 
displayed  by  the  user  as  described  above. 
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Figure  13:  Signal  Accer  Display 


6.4.  Goometric  Access 

Using  the  3D  role  latitudc/longitudc/elevation  description,  compute 
geometric  properties  such  as  containment.  subsumed  by  intersection , 
aJtacency.  and  closest  point  A  list  of  role  schema  ID  s  which  satisfy  the 
geometric  constraints  is  created.  In  the  case  of  intersection  and 
adjacency  a  temporary  role  schema  with  31)11)  is  generated  with  the 
results  (point,  line,  polygon)  of  the  geometric  operation  for  each  pair  o( 
database  role  schema. 

6.5.  Integrating  Access  Methods 

In  order  to  generate  answers  for  several  of  the  map  database  q  jciics 
^osed  at  the  beginning  of  this  paper,  we  must  actually  perform 
sequences  of  symbolic,  signal,  template,  and  geometric  access  functions 
I'hcre  are  clearly  difference  costs  associated  with  each  method, 
geometric  computation  being  the  most  expensive,  symbol  to  signal 
being  the  least  expensive  We  currently  require  that  the  user  specify  the 


^quencing  of  access  methods,  with  CONCFPTMAP  providing 
automatic  storage  of  temporary  results  in  the  form  of  querylists  of  role 
schema  which  satisfy  a  primitive  query,  l,et  us  analyze  those  sample 
queries  in  terms  of  our  query  primitives. 

•  "How  many  bridges  cross  the  Potomac  River  between 
Virginia  and  the  Distr  of  Columbia." 

Get  symbolic  level  from  symbolic  level  with  template  and 
geometric  constraint 

•  'Display  images  of  National  Airport  before  1976." 

Get  image  from  symbolic  level 

•  "What  is  the  closest  building  to  this  geographic 
poi’ t"  l point  to  screen]. 

Gel  symbolic  ievel  from  template,  signal  and  geometric 
constraint 

•  Where  is  this  geographic  pomt"  | specify  geodetic 

coordinate] 

Gel  signa  and  symbolic  level  from  signal  constraint 
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6.6.  Hierarchical  Query  Resolution 

Consider  ihc  query  "where  is  the  intersection  of  31 '*  and  m  street". 
One  approach  is  to  simply  find  the  31)  map  descriptions  for  each  of  the 
roles  (symbol  to  signal),  and  perform  a  geometric  operation  to  calculate 
a  ground  eixirdinate  position.  The  concept  map  provides  the  capability 
to  use  symbolic  geometric  relationships  to  generate  the  following 
response: 

(greater  Washington  d.c.> 

(district  of  Columbia) 

(northwest  Washington) 

(georgetown  area) 

(georgetown  business  district) 

(1  at  N38  54  18  (374)  Ion  U77  3  41  (213)  el 

When  the  actual  geodetic  location  of  an  intersection  is  required,  a 
geometric  operation  must  be  performed,  unless  it  is  defined  as  a  map 
concept.  However  each  of  the  other  responses  resulted  from  a  traversal 
of  a  containment  tree  which  is  maintained  by  the  concept  map  database. 
I  he  containment  tree  is  precomputed  from  all  map  concept  features 
having  polygonal  3DID  descriptions  using  the  geometric  relationships 
of  subsumed  by  and  contains.  Features  having  a  clear  hierarchy  of  level 
of  detail  such  as  political  and  cultural  (neighborhood)  concepts  can 
form  the  basis  for  partitioning  of  other  map  features  to  improve  query 
performance  and  to  better  model  the  spatial  organization  as  more  than 
just  a  collection  of  independent  concepts. 

For  this  reason,  we  would  like  to  explore  building  hierarchical 
descriptions  using  the  concept  map  database.  We  can  anticipate  its  use 
as  a  knowledge  source  for  more  complex  matching,  for  instance  in 
symbolic  scene  recognition.  Fo'  example,  the  occurrance  of  role 
descriptions  for  oil  lank  farm,  power  transformers,  and  cooling  towers 
within  dose  physical  proximity,  indicates  the  area  may  be  power  plant 
or  industrial. 

7. 3D  Map  Display 

A  central  problem  liar  a  variety  of  cartographic  tasks  is  flexible  access 
to  3D  map  databases  (6).  Tasks  include  inspection  and  veriflealioo  of 
spatial  databases,  incremental  update,  and  feature  enhancement. 
CONCKPTMAP  provides  tools  for  the  selection  of  ground  area  either 
through  image-to-map  correspondence  (ic.  describing  the  area  to  be 
portrayed  via  digital  imagery)  or  direct  specification  of  map 
coordinates.  The  photograph  in  Figure  14  shows  a  IWI  frame  window 
containing  a  two  dimensional  map  image  of  an  area  around  Washington 
D.  C..  This  13  color-class  thematic  image3  shows  areas  such  as  forest 
and  park  (green),  water  (blue),  residential  (yellow),  and  high  density 
urban  (brown),  it  was  generated  by  scan  conversion  of  a  polygon  map 
database  provided  by  die  Defense  Mapping  Agency  (DLMS  Level  1) 


In  this  application,  the  user  indicates  a  rectangular  area  of  interest  in 
the  map  image,  specifies  the  center  point  (west  of  National  airport), 
viewing  position  (from  the  southeast),  and  view  angle.  This  is  done  by 
tracking  a  cursor  on  the  display  to  minimize  the  amount  of  knowledge 
that  the  user  must  have  of  the  actual  3D  coordinate  system. 

The  photograph  in  Figure  IS  shows  the  result  of  the  3D  map 
generation.  For  each  image  point  in  the  area  specified  by  the  user,  a 
map  coordinate  is  calculated  (latitude,  longitude,  elevation).  A  3D 
surface  description  is  generated  using  the  thematic  color  from  the  map 
image,  and  this  description  is  passed  to  a  3D  shaded  raster  graphics 
display  program  (2).  The  resulting  map  image  is  then  displayed  by 
BROWSE 

The  CONCEPTMAP  database  provides  3D  map  feature  descriptions 
for  the  generation  of  cartographically  accurate  urban  scenes.  The 
DLMS  scene  as  generated  in  IS  is  used  as  a  base  map.  onto  which  we 
project  our  map  database  features.  The  photograph  in  Figure  16  shows 
a  view  of  the  Foggy  Bottom  area  with  the  observer  looking  towards  the 
southeast  from  above  the  intersection  of  Vitgina  Avenue  and  23rd 
Street  Buildings  in  the  scene  are  (from  left  to  right)  constitution  hall 
(clipped  to  the  scene  viewport),  interior  department,  civil  service 
commission,  bureau  of  indian  affairs,  federal  reserve  board,  stale 
department  and  national  academy  of  science.  Roads  arc  Virginia  avenue 
(bottom  right  to  center  left),  C  street  (center  left  to  middle  right!  and 
constitution  avenue  (running  along  the  light/dark  terrain  boundary) 
Ihc  linear  feature  running  between  the  interior  department  and  c  «!J 
service  commission  and  occluded  by  the  line  of  buildings  in  the  rear  of 
the  scene  is  the  boundary  of  the  map  description  for  foggy  bottom 

8.  Conclusions 

We  have  discussed  the  current  imolementation  of  a  large  scale  spatial 
map  database  organized  around  a  cot  cepl  map  representation  which 
provides  for  die  hierarchical  description  of  complex  natural  and  man¬ 
made  features.  User  defined  views  are  supported  by  allowing  concepts 
to  take  on  mutopfe  rotes,  while  maintaining  a  consistent  3  dimensional 
map  coordinate  representation.  We  have  shown  how  the 
CONCEPTMAP  database  can  be  used  for  flexible  aoccss  into  an  image 
database,  display  of  3D  urban  scenes,  and  for  query  into  vadal 
databases.  We  believe  that  this  work  has  applications  in  a  variety  of 
task  domains  where  knowledge  representation  can  be  viewed  in  terms 
of  3  dimensional  spatial  organizations,  particularly  in  cartography, 
photo- interpretation,  and  geological  modeling. 
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Figure  14:  2D  Washington  D.C.  Terrain  Map 


Figure  15:  3D  Washington  D.C.  Terrain  Map 


Figure  16:  3D  View  of  Foggy  Bottom  Area 
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Abstract 

The  use  of  rotationally  symmetric  operators  in  visiou 
is  reviewed  and  conditions  for  rotational  symmetry  are 
derived  for  linear  and  quadratic  forms  in  the  first  and 
second  partial  directional  derivatives  of  a  function  f(x,y). 
Surface  interpolation  is  considered  to  be  the  process  of 
computing  the  most  conservative  solution  consistent  with 
boundary  conditions.  The  ’’most,  conservative”  solution  is 
modelled  using  the  calculus  of  variations  to  find  the  mini¬ 
mum  function  that  satisfies  a  given  performance  index.  To 
guarantee  the  existence  of  a  minimum  function,  Crimson 
has  recently  suggested  that  the  performance  index  should 
be  a  sc-minorm.  It  is  shown  that  all  quadratic  forms  iu 
Lhc  second  partial  derivatives  of  the  surface  satisfy  this 
criterion.  The  seminorms  that  are,  in  addition,  rotationally 
symmetric  form  a  vector  space  whose  basis  is  the  square 
Laplacian  and  the  quadratic  variation.  VVliercas  both  semi- 
norms  give  rise  to  the  same  Ruler  condition  m  the  interior, 
the  quadratic  variation  ofi'ers  the  tighter  constraint  at  the 
boundary  and  is  to  be  preferred  for  surface  interpolation. 

This  report  describes  research  done  at  the»  Artificial 
Intelligence  Laboratory  of  the  Massachusetts  Institute  of 
Technology.  Support  for  the  laboratory's  artificial  intel¬ 
ligence  research  is  provided  in  part  by  the  Advanced  Re¬ 
search  Projects  Agency  of  the  Department  of  Defense  under 
Office  of  Naval  Research  contract  NOOOH-75-C-0643. 


1.  Introduction 

Two  separate  themes  from  the  Computer  Vision  litera¬ 
ture  come  together  in  this  paper:  the  use  of  rotationally 
symmetric  operators,  and  the  idea  that  several  modules 
of  visual  perception  require  that  the  "most  conservative" 
solution  that  meets  a  given  set  of  boundary  conditions  be 
computed.  The  two  themes  are  combined  in  an  investiga¬ 
tion  of  which  operator  to  use  in  the  interpolation  of  smooth 
surfaces  from  one-dimensional  boundary  constraint!.  Such 
constraints  arise  naturally  in  a  variety  of  visual  problems. 


In  the  next  section  we  review  the  role  of  rotationally 
symmetric  operators  in  Computer  Vision,  and  we  derive 
conditions  which  linear  and  quadratic  forms  in  the  first  and 
second  directional  derivatives  must  satisfy  in  order  to  be 
rotationally  symmetric.  We  then  discuss  the  idea  that  vi¬ 
sion  is  a  conservative  process,  citing  examples  from  both 
figure  perception  and  scene  analysis.  The  '  most  conser¬ 
vative’  solution  is  modelled  using  the  calculus  of  varia¬ 
tions  to  find  the  minimum  function  that  satisfies  a  given 
performance  index.  A  major  problem  associated  with  the 
use  of  the  calculus  of  variations  is  guaranteeing  the  exis¬ 
tence  of  a  minimum  function  (sec  Tor  example  Courant  and 
Hilbert  1953,  p.173).  A  theorem  of  Grimsou(1981,  theorem 
2)  proves  that  a  sufficient  condition  for  the  existence  of  a 
minimum  is  that  the  performance  index  should  be  a  semi- 
nonn  on  the  space  or  functions.  The  condition  is  not  neces¬ 
sary.  For  example,  Horn(198l)  has  determined  the  curve 
that  minimizes  the  integral  square  curvature  subject  to  tan- 
gcncy  conditions  at  the  end  points;  the  performance  index 
in  this  case  is  not  a  seminorm. 

Gnmsi:n(l981)  Dotes  that  many  intuitively  plausible 
performance  indices  based  on  mean  and  Gaussian  curvature 
arc  not  aeminorms,  but  that  the  square  Laplacian  f\  -f 
and  the  quadratic  variation  -f- /* 

are.  Wc  show  here  that  toy  quadratic  form  in  fxy,  and 
fyy  is  a  seminorm. 

To  further  constrain  the  choice  of  performance  index 
in  the  infinite  set  of  quadratic  forms,  we  require  in  addition 
that-  the  quadratic  form  should  be  rotationally  symmetric. 
We  prove  that  there  are  essentially  two  different  choices: 
thr  square  Laplacian  and  the  quadratic  variation.  All  the 
remaining  possibilit  ies  are  linear  combinations,  that  is,  form 
a  vector  space  with  these  two  as  a  basis. 

To  choose  between  the  square  Laplacian  and  the  quad¬ 
ratic  variation,  wc  consider  their  respective  Ruler  conditions 
and  natural  boundary  conditions  (Courant  and  Hilbert, 
1953).  The  Ruler  conditions  are  identical,  but  the  natural 
boundary  conditions,  which  are  derived  from  the  statics  of 
a  deformed  thin  plate,  favor  the  quadratic  variation  since 
they  offer  tighter  constraint  in  this  case. 
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2.  Rotationally  symmetric  operators  in  vision 

A  major  concern  of  Computer  Vision  is  the  isolation 
of  constraints  that  combine  with  the  information  provided 
in  the  image  to  yield  an  interpretation.  Early  work  on 
polyhedra  (Clowes  197),  HuDinan  1971,  Mackworth  1973, 
Waltz  1972,  Sugibara  1978,  1981,  Kanadc  1981)  focussed 
upon  the  discovery  of  constraints  deriving  from  the  image 
forming  process,  constraints  that  relate  image  fragments, 
like  junctions  and  lines,  to  their  scene  counterparts,  ver¬ 
tices  and  edges.  As  Computer  Vision  turned  its  attention 
away  from  plane-fared  objects  to  the  natural  world,  other 
constraints  were  required.  Often  the  constraints  expressed 
some  facet  of  the  intuitive  notion  of  ’’smoothness”  and 
did  so  in  a  way  that  supported  useful  computations  (Strat 
1979,  Brooks  1979,  Ikeuchi  and  Horn  1981,  Woodham  1978, 
Horn  and  Schunck  1981).  Recently,  smoothness  and  image 
forming  have  been  combined  using  differential  geometry 
(Crimson  1981,  Witkin  1981,  Binford  1981). 

One  constraint  that  is  usually  implicit,  but  is  occa¬ 
sionally  made  explicit,  expresses  the  idea  that  perceptual 
processes  are  often  approximately  isotropic.  It  seems  that 
humans  usually  do  not  show  slroDg  directional  preferences 
when  detecting  edges,  motion,  or  reflectance  boundaries. 
We  seem  to  be  equally  adept  at  perceiving  the  layout  and 
orientation  of  a  visible  surface  regardless  of  its  orientation 
relative  to  the  view  vector.  Ullman(l97f>)  argues  for  an  ex¬ 
plicit  isotropy  constraint  in  his  work  on  subjective  contours 
(see  also  Knuth  1979). 

Processes  that  are  isotropic  are  naturally  computed 
by  rotationally  symmetric  operators,  since  the  values  they 
return  are  unaffected  by  the  coordinate  system  chosen  for 
the  image.  Conversely,  rotationally  symmetric  operators 
compute  isotropic  information.  As  we  shall  see,  many 
operators  that  have  been  proposed  for  vision  arc  not  rota¬ 
tionally  symmetric  but  directionally  selective.  Some  authors 
have,  however,  proposed  rotationally  symmetric  operators, 
particularly  for  early  visual  processing. 

Precise  dcGuitions  of  rotational  symmetry  for  func¬ 
tions,  operators  (or  functionals),  and,  by  specialization, 
matrices  are  given  in  the  following  section.  In  the  rest  of 
this  section  we  assume  that  the  definitions  are  already  un¬ 
derstood. 

Some  kinds  of  blurring  in  an  image  forming  system 
can  be  approximated  by  convolution  with  a  Gaussian.  The 
rotationally  symmetric  Gaussian  can  be  defined  by: 

6’(r)=  Ix,r*exp(~). 

Pratt(!978)  picsents  several  techniques,  such  as  con¬ 
volution  with  the  generalized  inverse  of  the  blur  function, 
for  restoring  the  image,  (sec  for  example,  his  figures  14.2.1, 
14.3.2). 

The  Laplaciao  A  =  /„  - 1-  /yy  is  well  known  to  be 
rotationally  symmetric*  and  its  use  has  been  proposed 
several  times  in  Computer  Vision  and  Image  Processing. 


If  an  image  is  blurred  in  a  way  that  can  be  approximately 
modelled  by  passing  the  image  through  a  system  with  a 
Gaussian  point  spread  function,  then  it  can  be  sharpened 
by  subtracting  a  multiple  of  its  Laplaciao  (Roscnfeld  and 
Kak  197fi,  p.184,  Prewitt  1970,  p.  107).  Prali(1978,  figure 
17.4.5)  illustrates  the  use  of  the  Laplaciao  for  enhancing 
the  edges  in  an  image. 

Wcska.  Dyer  and  Rosenfeld(l97G)  note  that  convolving 
a  step  edge  with  a  Laplaciao  operator  gives  rise  to  a  pulse 
pair:  a  negative  pulse  at  the  transition  from  the  lower 
plateau  to  the  edge,  and  a  positive  pulse  at  the  transition 
from  the  edge  to  the  upper  plateau  (sec  also  Horn  1974, 
Marr  and  Hildreth  1980).  They  suggested  that  the  image 
intensities  at  the  locations  of  the  positive  and  negative 
pulses  could  be  used  to  set  thresholds  to  use  in  segmenting 
the  image  into  regions. 

Sevcial  authors  have  noted  the  relative  insensitivity  of 
human  perception  to  small  intensity  gradients  (llcrskovits 
and  Binford  1970,  Mnrr  1976,  Marr  and  Hildreth  1980, 
McC.iim  ct.  al.  1974).  They  have  noted  that  the  effect 
can  be  explained  by  assuming  that  the  vision  system  uses 
operators  approximating  second  derivatives.  This  so-called 
lateral  inhibition  effect  seems  to  be  performed  by  center  sur¬ 
round  operators  in  the  retina  (see  for  example  Richter  and 
Ullman  1980).  The  Laplacian  is  a  rotationally  symmetric 
second  differentia]  operator,  and  an  attractive  candidate  to 
perform  lateral  inhibition. 

The  use  of  the  Laplacian  for  edge  detection  was  pro¬ 
posed  by  Horn(l974)  in  a  study  of  the  determination  of 
lightness.  Following  Land  and  McCann(197l),  Horn  re¬ 
stricted  attention  to  images  of  planes  colored  with  patches 
of  uniform  reflectance  or  color.  Within  a  patch,  grey 
level  variations  are  due  to  small  variations  in  illumination, 
and  they  arc  smooth  compared  to  the  abrupt  changes  be¬ 
tween  patches.  The  conventional  approach  to  detecting 
significant  changes  in  intensity  had  been  to  note  that  the 
gradient  of  the  image  is  small  within  a  region,  but  is  infinite 
across  a  reflectance  boundary  between  regions.  For  a  par¬ 
ticular  image  tesseiation  and  quantization  of  grey  levels, 
the  gradient  is  always  finite.  It  is  usually  much  larger, 
however,  at  a  reflectance  boundary  than  it  is  within  a 
region.  Horn(1974)  rejected  using  the  gradient  since  "the 
first  partial  derivatives  arc  directional  and  thus  unsuitable 
since  they  will  for  example  completely  eliminate  evidence 
of  edges  running  in  a  direction  parallel  to  their  direction  of 
differentiation.”  The  Laplacian  is  the  lowest  order  linear 
combination  of  derivatives  that  is  rotationally  symmetric. 

A  reflectance  boundary  can  be  detected  by  the  paired  posi¬ 
tive  and  negative  peaks  on  either  side  of  the  boundary,  and 
localited  by  noting  the  position  where  the  Laplacian  crosses 
sero  between  the  peaks*. 


1  A  proof  of  thii  i*  given  in  Section  3  below. 

*  Biaford(lMl)  for  more  on  the  distinction  between  detection 
end  localisation  of  an  intensity  change. 
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Marr  and  Hildrcth(1980)  have  proposed  that  edges  are 
detected  io  the  human  visual  system  by  au  operator  that 
approximates  A G,  where  A  is  the  Laplaciao,  and  G  is  a 
rolationally  symmetric  Gaussian.  We  shall  show  in  the  next 
section  that  the  application  of  a  rotationally  symmetric 
operator,  such  as  the  Laplacian,  to  a  rotationally  symmetric 
function,  such  as  the  Gaussian,  is  itself  rotationally  sym¬ 
metric.  It  follows  that  the  Marr-IIildreth  operator  is  rota¬ 
tionally  symmetric.  Marr  and  Hildreth  note  that  intensity 
changes  occur  at  a  number  of  scales  and  are  often  super¬ 
imposed.  They  suggest  that  an  image  should  be  smoothed 
by  a  number  of  bandpass  filters  to  isolate  the  changes  at  a 
particular  range  of  scales.  The  Gaussian  is  chosen  as  the 
filter  to  optimize  localization  of  changes  in  both  the  spatial 
and  frequency  domains. 

We  noted  above  that  the  Gaussian  and  the  Laplacian 
have  Ggurcd  prominently  io  early  visual  processing.  The 
Gaussian  has  mostly  been  used  to  approximate  the  point 
spread  function  corresponding  to  the  blurring  of  a  point 
source.  Marr  and  llildrcth  deliberately  introduce  Gaussian 
blurring.  They  further  note  that  A G  can  be  approximated 
by  a  difference  of  Gaussians,  G\  —  6' 2.  Nishihara  and 
!.arson(1981)  note  that  the  difference  of  Gaussians  is  to  be 
preferred  on  grounds  of  efficiency.  Maclcod(i972)  proposes 
an  edge  detection  operator  that  is  the  difference  of  two 
Gaussians.  However,  no  analysis  of  its  performance  is  given, 
and  no  indication  is  givcu  that  the  operator  approximated 
a  low-pass  filtered  second  derivative. 

Regarding  the  use  of  the  Laplacian,  Marr  and  Hildreth 
do  not  seem  to  make  isotropy  an  explicit  constraint  on 
edge,  detection.  Instead,  Hildrclb(1980,page  13)  notes  that 
”a  number  of  practical  considerations,  which  will  be  il¬ 
luminated  in  the  discussion'  of  the  implementation,  sug¬ 
gested  that  the  . . .  operators  not  be  directional”.  Suppose 
instead  that  directional  operators  are  used.  The  simplest  al¬ 
gorithm  for  edge  detection  has  two  stages.  First,  the  image 
is  convolved  with  the  directional  operators  in  "sufficiently 
many”  directions.  Second,  the  outputs  arc  combined  to 
determine  the  orientation  and  extent  of  intensity  changes. 
Regarding  the  first  stage,  both  Marr  and  Hildreth(1980, 
page  193)  and  Hildreth(1980,  page  40)  claim  that  the  cost  of 
convolving  the  image  with  a  "sufficient”  number  of  operators 
is  excessive.  They  show  that  a  single  rotationally  symmetric 
operator  (the  Laplacian)  gives  precisely  the  same  results  if 
a  condition  called  "linear  variation”  holds.  Regarding  the 
second  stage,  Hildreth(1980,  page  36)  observes  that  edges 
in  a  direction  close  to  that  of  the  mask  are  elongated  in 
the  direction  of  the  mask.  She  also  notes  that  operators  at 
several  orientations  give  significant  responses  to  any  given 
edge,  and  that  combining  the  responses  is  non-trivial. 

There  are  two  essentially  different  issues  here  that  need 
to  be  clearly  separated.  Intensity  changes  first  have  to  be 
detected  and  then  localised  u  a  set  of  "feature  points” 
marking  the  position  of  the  change  in  the  image,  and 
characteristics  of  the  corresponding  edge.  The  detection  of 
feature  points  is  inherently  isotropic,  as  IIorn(1974)  noted. 


The  feature  points  have  then  to  be  combined  to  produce 
descriptions  of  edge  segments.  ICdge  segments  are  clearly 
directional,  indeed  a  central  problem  concerns  the  deter¬ 
mination  of  the  direction  of  an  edge  in  an  image.  The  com¬ 
putation  of  rich  descriptions  of  edge  segments  is,  as  Hildreth 
notes,  not  at  all  easy.  Marr’s(1976)  original  Primal  Sketch 
work  was  almost  entirely  concerned  with  it.  Dinford(1981) 
discusses  the  application  of  directional  operators  to  com¬ 
pute  the  dircctiouality  of  an  edge. 

The  Gaussian  and  Laplacian  arc  not  the  only  rota¬ 
tionally  symmetric  operators  that  have  been  proposed  in 
computer  vision.  Prewitt(  1970,  p.  107)  observes  that 
"derivatives  of  all  orders  can  be  used  to  form  isotropic  non¬ 
linear  differential  operators,  provided  that  derivatives  of 
odd  order  appear  only  in  even  functions.  The  simplest  of 
these  ...  is  the  squared  gradient”,  namely  V  -  V,  where  V 
is  the  column  vector 


Earlier  in  the  same  article,  Prcwill(1970,  p.  85)  sug¬ 
gests  that  "the  Hankri  transformation  enters  naturally  in 
the  analysis  of  systems  with  isotropic  point  spread  functions 
and  greatly  facilitates  restoration.”  The  suggestion  does 
not  appear  to  have  been  investigated  in  computer  vision. 

We  noted  earlier  that  an  important  ispect  of  modell¬ 
ing  perception  is  the  isolation  of  constr-  uts  which  capture 
some  facet  of  smoothness.  Horn  and  r.huuck(1981)  con¬ 
sider  the  determination  of  optical  flow  elds  and  note  that 
"if  every  point  of  the  brightness  pattern  can  move  indepen¬ 
dently,  there  is  little  hope  of  recovering  the  velocities” .  One 
way  to  express  the  additional  constraint  of  smoothness  is 
to  minimise  the  integral  of  the  performance  index 

S[u,  v )  =  (n l  -f  uj)  +  ( vl  +  tij), 

where  u  and  v  are  the  1  and  y  components  of  the  opti¬ 
cal  flow,  and  subscripts  denote  partial  differentiation.  We 
shall  show  in  the  next  section  that  this  operator  is  rotation- 
ally  symmetric.  In  many  simple  situations  the  smoothness 
constraint  is  violated  significantly  only  at  occluding  bound¬ 
aries. 

We  conclude  this  review  of  the  use  of  rotationally  sym¬ 
metric  operators  in  vision  with  Grimsou’s(1981)  work  on 
surface  interpolation.  As  it  will  be  the  focus  of  Section 
5,  our  remarks  will  be  brief.  The  Marr-Poggio  theory 
of  human  stereo  vision  yields  the  disparity  (scaled  depth) 
at  matched  edge  points  that  arc  computed  by  the  Marr- 
IIildreth  approach  described  above.  The  disparity  map  is 
as  sparse  os  the  set  of  matched  edge  points,  whereas  human 
perception  is  of  smooth  surfaces  passing  through  the  given 
disparity  points.  Grimson  (1981)  interpolates  a  smooth  sur¬ 
face  from  the  given  set  of  edge  points  by  a  local  parallel 
algorithm  that  applies  a  rotatioually  symmetric  operator 
to  minimise  the  quadratic  variation  introduced  above. 
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3.  Conditions  for  rotational  symmetry 

A  function  /:SJ2  St  is  rotationally  symmetric  if 
its  polar  form  is  only  dependent  oil  radial  distance  r  = 
(j3  +  yJ)J  and  not  on  direction  <f>  =-  tan  '1  |.  Clearly, 
a  function  is  rotationnlly  symmetric  if  and  only  if  it  can 
lie  represented  as  a  function  of  (x2-|-y2)^  An  alterna¬ 
tive  definition  can  be  given  that  is  often  more  convenient 
for  functions,  and  that  can  be  generalized  to  operators.  A 
function  is  rotationally  symmetric  if  and  only  if  it  yields 
the  same  value  under  an  arbitrary  rotation  of  coordinates. 

An  anticlockwise  rotation  from  one  set  of  image  coor¬ 
dinates  (x,y)  to  another  (X,Y)  is  effected  by  a  rotation 
matrix: 


xl  _ 

cosd 

sind  [x 

rJ  ~ 

—  sind 

cos^JLy. 

'x 

=  R  . 
|yj 


For  convenience,  we  shall  denote  cos  <p  by  c  and  sind> 
by  s.  To  simplify  notation,  we  shall  not  make  explicit 
the  dependence  of  the  rotation  matrix  R  on  the  angle  4. 

A  function  f  is  rotationally  symmetric  if  and  only  if  the 
untransformed  version  f(x,y)  gives  the  same  value  as  the 
transformed  version  f(X,Y).  We  shall  occasionally  find  it 
useful  to  borrow  the  mathematical  shorthand  that  equates 
a  function  f(X,Y)  with  a  function  of  a  single  vector  argu¬ 
ment  f(R[x,y  l1). 

Example  l.  The  function  fi(x,y)  =  (x2  -|-  y2)  is  rota¬ 
tionally  symmetric: 

MX,  Y)  =  ((xc  +  ys)2  +  (yc  —  x*)J) 

=  (**  +  y2) 

=  /i(*.  y)- 

Example  2.  The  function  fj(x,y)  =  xy  is  not  rotation- 
ally  symmetric: 

MX,  Y)  =  (xc  +  y  «)(y«  -  xt) 
y2  —  x2 

—  xy  cos  2d  - — - sin  2d, 

and  so  /2(X,  V)  =  /»(x,  y)  only  when  <f>  =  0  or  d  =  jr. 

We  can  extend  the  definition  of  rotational  symmetry 
to  operators 

0:(SJ  ~  (St2  * — *  St). 

An  operator  0  is  rotationally  symmetric  if  0(f)  is  a 
rotationally  symmetric  function,  for  all  functions  /:0t2  >-* 

St. 

Example  3.  The  function  produced  by  the  operator  Oi, 
defined  by 

Oi(/X*.y)  = 


is  rotationally  symmetric  if  and  only  if  /  is.  In  general  then, 
the  operator  0 1  is  not  rotationally  symmetric.  However, 
the  Gaussian  is  rotationally  symmetric,  as  it  combines  ex¬ 
amples  1  and  3. 

Most  of  the  operators  of  interest  in  computer  vision 
arc  combinations  of  the  first  and  second  directional  deriva- 

t,ves  3i<  5%7>i’  sfcz'  an'1  fp-  We  nced  deter- 

miue  the  clfect  of  a  coordinate  rotation  on  these  directional 
derivatives.  By  the  chain  rule, 

A  -  dzL  JL  4. 
dx  dx  dX  dx  dY 
_ d _ d_ 

~CdX  SdY' 

Similarly, 

a  3  .  d 

dy  S0X+CdY 

It  follows  that 


where  T  denotes  matrix  transpose.  Since  R  is  a  rota¬ 
tion  matrix,  its  transpose  equals  its  inverse,  so 


Operators  in  general,  and  differential  operators  in  par¬ 
ticular,  depend  upon  the  choice  of  coordinate  frame.  To 
make  the  dependence  of  the  differential  operator  on  the 
choice  of  coordinate  frame  explicit,  we  introduce  the  nota¬ 
tion 

With  this  notation,  equation  (1)  becomes 

V(x,r)  =  RV(XiV),  (2) 

where  is  the  column  vector 


Proposition  1.  Linear  combinations  of  £  and  are 
not  rotationally  symmetric. 

Proof.  Any  linear  form  in  the  first  directional  deriva¬ 
tives  bas  the  form 

lx  mFix.s)- 

The  condition  for  rotational  symmetry  is 
[x  m|^(x,V)  =  (x 
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By  equation  (2), 

I*  PF(X.V)  =  lx  m]RV{IjV), 

and  so  the  linear  differential  operator  is  rotationally 
symmetric  if  and  only  if 


(\  H  =  (x  m]«. 

so  that  (X  n\  is  an  eigenvector  of  A.  The  eigenvalues  of  A 
arc  c-|-ts  and  c — is.  So  there  are  no  real  eigenvectors  unless 
^  is  a  multiple  of  ir.  Since  the  condition  is  not  satisfied  for 
all  <(>,  no  linear  combination  is  rotationally  symmetric.  | 
The  same  style  of  analysis  can  be  applied  to  other 
combinations  of  first  derivatives  such  as  the  operator 


ChU)  = 


A  fy 
A  +  Jy 


It  is  easy  to  show  that  O-^x.Y )  's  not  equal  to  02(x,y),  for 
example  when  <t>  —  J- 

In  section  2,  we  referred  to  an  operator  proposed  by 
Prcwitt(1970),  namely 


that  is,  the  vector  dot  product 


More  generally,  we  often  consider  quadratic  differential 
expressions  such  as 


Such  an  expression  is  called  a  quadratic  form  if  the  matrix 
is  symmetric,  that  is  /i  =  u.  By  equation  (1), 


so  that 


if  and  only  if 


RrMR=zM, 

where  A  is  an  arbitrary  rotation  matrix,  and 


Since  the  transpose  AT  of  a  rotation  matrix  A  is  the 
inverse  of  A,  a  quadratic  form  is  rolalioaally  symmetric  if 
and  only  if  the  corresponding  matrix  M  commutes  with  all 
rotation  matrices.  We  will  refer  to  matrices  M  having  this 
property  as  being  rotationally  symmetric. 


Lemma  1.  A  2  by  2  matrix  is  rotationally  symmetric 
if  and  only  if  it  has  the  form 


Proof.  Wc  require  liM  =  MR  hr  all  rotation  matrices 
A,  that  is 


if 


Expanding,  and  equating  terms,  this  holds  if  and  only 


M  +  v  =  0 

X  =  (. 

Alternatively,  only  the  operations  of  scaling  by  a  con¬ 
stant  k  and  multiplication  by  a  rotation  matrix  A'  commute 
with  all  rotation  matrices  in  two  dimensions  So  M  =  kff 
for  some  scale  factor  k  and  some  rotation  matrix  A,.g 

Proposition  2.  Up  to  scaling,  the  only  rotationally 
symmetric  quadratic  form  in  ^  and  ^  is  V(l  y) .  V(l  yj. 

Proof.  A  quadratic  form  in  £  and  ^  has  the  form 


0) 


To  be  rotationally  symmetric,  as  well  as  symmetric  (so 
that  it  is  a  quadratic  form),  Lemma  1  implies  that 


X  =  f 

M  —  0. 

It  follows  that  the  matrix  in  equation  (3)  is  X/j.| 

The  operator  f  \  +  /y  is  romnionly  used  as  a  measure 
of  the  contrast  across  an  intensity  change.  Notice  that 
other  measures  of  the  contrast,  such  as  (fx  -f  /y)1,  {fx  — 
/»)">  or  ll/xll  +  HAH  arc  not  rotationally  symmetric,  and 
therefore  respond  differently  to  edges  in  different  directions 
(sec  Koscnfcld  and  Kak  1976,  p279). 

Wc  now  consider  linear  and  quadratic  forms  in 
sfjv  5?37>and  ijr-  It  is  convenient  to  not  assume  = 
g|g-  for  the  developments  that  follow. 

The  first  task  is  to  find  a  matrix  A'  so  that 


a2 

=  A‘ 

l#J 

syvx 

The  (* i y)  element  of  the  matrix  A  *  will  be  denoted 
by  r,r  Applying  the  chain  rule  as  before,  but  Ibis  time  to 
relate  the  second  derivatives  in  [X ,Y)  to  those  in  (i,y),  we 
find  that  the  four  by  four  matrix  A*  can  be  written  in  the 
form 

1  Rscall  liw  riefinitioo  of  the  matrix  H  from  aquation  (0). 


158 


H  = 


(5) 


ruli7  r21R7} 

[rl2n7  r22RT\ 

Definition  1.  (ben  Israel  and  Greville  1974,  page  4l)Let 
A  =  [a,y|  and  B  =  [6,j]  be  m  by  m  and  n  by  n  matricea 
respectively.  The  mn  by  mn  matrix  A®  B,  called  the 
Kronecker  product  of  A  and  B,  is  defined  by  multiplying 
each  element  a(i,j)  of  A  by  the  matrix  B,  to  form  the  block 
matrix 


for  all  rotation  matrices  R  and  the  corresponding  rotation 
angle  <p.  Expanding  RT  0  R7  by  equation  (7),  we  find 


I* 


f), 


c2 

—sc 

—  8c 

s3  ‘ 

sc 

c2 

— -82 

— sc 

sc 

— -82 

c2 

—sc 

u* 

SC 

sc 

c2  - 

=  fX  M  V  £]. 


so  that 


—  52 

—sc 

—  8C 

82  * 

auB 
aai  B 

auB  . 
a22B 

“1  mB 

a2mB 

[X  P  *  £[ 

(®) 

SC 

SC 

-«2 

-82 

-82 
—  82 

—  SC 

—SC 

s* 

SC 

SC 

—a3- 

-am\B 

nm2B  . 

Omm  B 

It  follows  that 

With  this  notation. 


so  that 


R  = 


R‘ 

=  Rt 

0Rt, 

c* 

—sc 

—sc  s* 

sc 

c2 

— «*  — sc 

sc 

-82 

c*  —sc 

u* 

SC 

sc  c*  ■ 

(7) 


Note  that  the  elements  of  A®  B  are  naturally  indexed 
by  4-tuples: 

[A  0  B)ijkl  —  ai}bkl. 

We  state  without  proof  a  number  of  simple  properties 
of  the  0  operation.  They  are  essentially  straightforward 
consequences  of  the  properties  of  ordinary  multiplication, 
and  arc  stated  without  proof. 

Proposition  3 

(«j  (A  0  U)T  =  At®Bt 
(«)  (4  0  0)—I  =  A~t  @  B~~l 
(in)  (40B)0C  =  A®(8®C) 

For  the  remainder  of  the  paper,  we  restrict  attention 
to  the  application  of  0  to  R  and  its  transpose. 

Proposition  4.  The  rotationally  symmetric  linear  com¬ 
binations  of  ^r,  jfJj,  and  arc  linear  combina¬ 
tions  of  the  Laplacian  A  ^  +  ^r,  and  the  smoothness 

meMUre  afs;  —  3*3 ;• 

Proof.  Let  the  linear  combination  be 


[X  P  v  £!i 


&' 

Si 

& 


Following  the  proof  of  Proposition  1,  the  condition  for 
rotational  symmetry  is 


(X-f  m  +  v  0  01 


=  [0  0  0  0). 


f— 2s*  — 2sc  —2  sc  2  «*  1 

2sc  —2s*  —2s*  —2  sc 

0  0  0  0 

0  0  0  0  J 

=  [0  0  0  0] 

The  determinant  of  the  upper  left  2  by  2  submatrix  is 


(4s4  -f  4s  V)  =  4s*. 

Since  this  is  not  zero  for  all  angles  j>,  X — £  and  p-f -v  are 
both  zero.  A  basis  for  the  infinite  set  of  linear  combinations 
satisfying  these  conditions  is  provided  by  setting  X  and  p 
equal  to  one,  which  proves  the  Proposition.  | 

Before  turning  tc  quadratic  forms ,  analogous  to  Proposition 
(2),  we  define  a  projection  operator  on  Rr  ®  Rr  that 
makes  explicit  the  assumption  /xy  =  fyi. 

Definition  2.  Let  D  —  (dtJ)  be  a  4  by  4  matrix.  The 
projection  of  D  is  the  3  by  3  matrix  D": 

dn  (du  dls)  d14 

(dji+dai)  (dja  +  djj -f  dj3 -f  d3S)  (du-f-dyi) 

d*i  (d«  -f-  d4s) 

That  is,  the  second  and  third  columns  as  well  as  the 
second  and  third  rows  arc  combined  by  addition. 

Proposition  5. 


[o  6  b  c]0[a  6  6  c] 


is  equivalent  to 


[o  6  c\D  [o  6  c]7 


We 


where  O'  is  the  projection  of  D. 

The  proof  is  by  equating  terms,  and  is  omitted 
now  give  the  main  result  of  this  section. 

Proposition  6.  The  rotationally  symmetric  quadratic 
form,  in  &  >  3*35'  Si'  “d  S  form  a  vector  space.  If 

3*3*  ~  Si'  ®at«ces  associated  with  the  rotationally 
symmetric  quadratic  forms  project  to  3  by  3  matrices  of 
the  form 


[X  p  v  £jrtr®Hr  =  (X  p  tr  £|, 
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a  +  3  0  fi 

0  2a  0 

p  0  a+P  . 

ll  follows  that  the  rotationallv  symmetric  quadratic 
forms  that  satisfy  &  =--  sfehrm  a  vector^space  that 
has  the  quadratic  variation,  ( ^ j )  4"  ‘‘-i  5x5 v )  ^ Oy*  ^  1 

ami  the  square  Laplacian,  (fji  +  “  a  1,asis- 

Proof.  Since  the  matrix  in  a  quadratic  form  is  defined 

to  be  symmetric,  a  quadratic  form  in  vfys i  axOv ’  3vOx* 

#4  can  be  written 
a*1 


where  A  and  C  are  symmetric  2  by  2  matrices,  and  D  is  2  by 
2.  As  usual,  the  quadratic  form  is  rotatioually  symmetric 
if  and  only  if 


rt®rt 


a  n 

uT  c 


A  '  B 

bt  c 


rt®rt, 


where  R  is  an  arbitrary  rotation  matrix.  It  follows  that 


cRt 

sRt1[  A 

B 

A 

B] 

f  cRt 

sRt 

—  sRt 

cRTbT 

c\ 

BT 

C 

l — sRr 

cRT 

cRtA  +  sRtBt  cRt  B  4  sRtC 

—sRt  A  +  cRtBt  -*RtD  +  cRtC\ 

_  cAPj  -  sBRt  sARt  +  cBRt 
~  cBtRt  -  sCRt  sBtRt  +  cCRt 


Equating  submatrices,  wc  find  that  for  all  rotation 
angles  4 

c(RtA  —  ARt)  4-  >(RtBt  4  BRt)  =  0,  (8) 

c(RtC  -  CRt)  -  s(BtRt  4  RtB)  =  0,  (9) 

,{CRT-RTA)  +  c(RTBT  -BtRt)  =  0,  (10) 

s{RtC-ARt)  +  c(RtD-BRt)  =  0.  (11) 

Consider  equation  (10)  or  (11)  when  ^  =  J.  Equating 
terms,  we  find  that 


«ii  — 

°m  =  Cn 
oil  —  — Cjj 


(12) 


«ai  =»  — cia. 

Similarly,  equation  (8)  or  (9)  when  d  —  }  yields 


6u+*«=0.  (13) 

Expanding  equation  (8)  for  general  d  yield* 


bxt  4  Oj2  —  0, 

(14) 

1*22  —  Ull  — -  0, 

(15) 

621  +  b\t  +  022  ~  <*11  =  0. 

(16) 

Combining  equations  (12)  through  (16)  wc  Gnd  that  in  order 
to  be  rotatioually  symmetric,  the  matrix 

A 
BT 

has  the  form 

a  +  p  7  —7  P 

7  a  —  S  6  7 

— q  6  a  —  6  —7 

.  P  7  —Tat  +  P. 

A  matrix  of  this  form  projects  to 

'a  +  P  0  P 

0  2ct  0  , 

.  p  0  a+P  . 

where  a  =  bl2  —  a,i  and  P  —  b12.  It  is  easy  to  show  that 
linear  combinations  of  matrices  of  this  form  are  of  the  same 
form,  so  that  the  rotationally  symmetric  quadratic  forms 
constitute  a  vector  space.  Clearly,  the  square  Laplacian 
and  the  quadratic  variation,  corresponding  to  the  cases  a  = 
l,P  =  0  and  a  =  0,  P  =  1  respectively,  form  a  basis.* 

VVe  show  that  the  measure  of  smoothness  of  optical  flew 
proposed  by  Horn  and  Schunck(1981)  is  rotationally  sym¬ 
metric.  Recall  from  section  2  that  the  measure  is  defined 
by  the  operator 

5(«,u)  =  (uj+u;)+(^4-«i). 

We  extend  the  Kroncckcr  product  operator  ®  to  vec¬ 
tors,  and  then  show  bow  to  define  5'(u,  v)  in  terms  of  vector 
Kroueckcr  products. 

Definition  3.  (a)  Let  g  =  (at. .  .am)  and  i  =.  (6j. .  .frn] 
be  vectors.  The  Krooccker  product  of  g  and  h  is  the  mn 
dimensional  vector  |aIhi...a|fin  a26i. .  .ami>n], 

(b)  By  extension,  if  Q.  =  (Oi-.Oml  is  a  vector  of 
operators  and  /  =  |/j .../„]  is  a  vector  of  functions,  the 
Kroncckcr  product  of  Q_  and  £  is  the  mn  dimensional  vector 
of  functions 


c\ 


[Ox  (/,)...  Ox  (/„)...  <?«(/»)]. 

The  components  u  and  v  of  optical  Bow  are  functions  of  x, 
y,  and  t.  Recall  that  V(IlWj  =  [£  ^]r.  According  to 

definition  3, 

_  o.r  it  0u  dv  dv, 

.r-ljj  3J  Tz  Jjl, 

so  that 

S(u,  v)  =  (V(l, „)(?)(«  t»lr)  (V(,t>)(g(u  u)T). 


\ 

[ 
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If  the  coordinate  frame  is  rotated  through  4  by  the  matrix 
R,  the  optical  flow  components  become  f?[u  t)jT.  The 
llorn-Schuuck  measure  is  rotationally  symmetric  if  and 
only  if 

(R®R)T(R®R)  =  Iit 

where  /4  is  the  4  by  4  identity  matrix.  The  rotational 
symmetry  is  a  simple  consequence  of  Proposition  3. 

A  rotational!}’  symmetric  operator  has  the  general  form 

0(*,y)(V,  V  ®  V,  V  ®  V  ®  V,  •  •  •), 

and  its  application  to  a  rotationally  symmetric  function 
f(x,  y)  has  the  form 


itself.  The  information  computed  from  the  image  sets  the 
boundary  conditions,  and  the  constraints  determine  which 
(and  whether  a)  solution  to  the  boundary  value  problem  is 
found.  Horn(1974)  solved  an  instance  of  Poisson’s  problem 
using  Green’s  functions  to  determine  the  lightness  of  an 
image. 

Following  a  different  approach,  Ullman(  1979a)  studied 
the  perception  of  apparent  motion  generated  by  two  suc¬ 
cessive  frames  consisting  of  isolated  dots  of  equal  inten¬ 
sity  moving  independently  or  each  other.  Without  con¬ 
straint,  all  possible  pairings,  or  "correspondences”,  of  dots 
in  the  first  frame  with  dots  in  the  second  are  equally  likely. 
Ullmau  dclincd  the  "most  likely’  correspondence  to  be  the 
one  that  minimised  the  sum 


0(*.  »)(/(*.  v))- 

To  sec  that  this  is  rotationally  symmetric,  we  rotate  the 
coordinate  frame  to  ( X ,  V)  by  a  matrix  R  as  before.  Since 
0  and  /  are  rotationally  symmetric,  all  the  occurences  of 
R  (including  its  Kronecker  square,  cube,  and  so  on)  intro¬ 
duced  by  the  frame  change  can  be  deleted.  It  follows  that 
the  application  of  a  rotationally  symmetric  operator  to  a 
rotationally  symmetric  function  is  itself  rotationally  sym¬ 
metric.  In  particular,  the  A(C)  filters  of  the  Marr-Hildreth 
theory  of  edge  detection  arc  rotationally  symmetric. 

4.  Vision  as  a  conservative  process 

The  second  theme  of  this  paper  is  that  a  number  of 
vision  modules  construct  the  most  conservative  interpreta¬ 
tion  that  is  consistent  with  the  given  data,  ami  that  is 
subject  to  an  appropriate  set  of  suitably  formulated  con¬ 
straints.  A  major  concern  of  Computer  Vision  has  al¬ 
ways  been  the  isolation  of  constraints  that  enable  the  in¬ 
terpretation  of  an  image.  Constraints  embody  observations 
about  the  way  the  world  is,  at  least,  most  of  the  time. 
Although  such  observations  can  be  as  specific  as  catalog¬ 
ing  familiar  figures  and  shapes,  it  has  proved  more  fruit¬ 
ful  to  first  uncover  constraints  that  correspond  to  general 
observations  that  arc  widely  applicable.  Constraints  are 
used  together  with  the  data  computed  from  the  image  to 
construct  an  interpretation.  The  representations  of  the  in¬ 
formation  from  the  image  and  the  constraints  determine, 
and  are  determined  by,  the  interpretation  process.  For  ex¬ 
ample,  early  blocks  world  programs  represented  constraints 
as  catalogs  of  labellings,  an  approach  that  led  naturally  to 
search  processes  for  interpretation  (Clowes  1971,  Kanade 
1981). 

As  Computer  Vision  has  turned  its  attention  to  images 
of  the  natural  world,  constraints  have  concerned  the  smooth¬ 
ness  of  surfaces  and  movement.  The  relationship  to  bound¬ 
ary  value  problems  of  physics  and  mathematics  suggests 


* 


l<i<n 
1  <  j<  m 

where  n  is  the  number  of  dots  iu  the  first  frame,  m  is  the 
number  of  dots  in  the  second  frame,  and  xtJ  is  one  if  the  ith 
dot  of  the  first  frame  I\  is  paired  with  the  7th  dot  of  the 
second  frame  Q,.  else  zero.  The  weight  q,3  is  the  "cost”  of 
pairing  P,  with  Q,,  and  might,  for  example,  be  related  to 
the  image  distance  between  the  paired  points.  The  prob¬ 
lem  of  finding  the  minimal  correspondence  is  considered  in 
terms  of  integer  programming.  If  correspondences  are  as¬ 
sumed  to  be  covering  mappings,  the  following  linear  con¬ 
straints  apply  to  the 

Vi,  1  <  i  <  n  £  Xij  >  1, 

1 


and 


V>,  1  <  3  <  m 

!<»<n 


Ullmao  restricted  the  set  of  Q,  that  can  be  paired  with  P* 
to  those  whose  positions  were  close  to  P,.  Following  Arrow, 
Hurwicz,  and  Uzatva(1958),  be  set  up  the  iterative  scheme 


-  £ 

1  <»<n 

=  £ 

1  <  ><  m 

The  approach  can  be  extended  to  mappings  that  are  not 
one-one,  as  well  as  to  continous  motion.  A  major  problem 
with  the  approach  is  guaranteeing  the  convergence  of  the 
algorithm.  This  is  determined  largely  by  the  properties  of 
the  costs  q,},  but  this  was  not  investigated,  aside  from  a 
comment  on  the  empirical  determination  of  the  qtJ  (see  also 
U  liman  1979b). 


A 

1 

% 
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One  limitation  of  (Jllman’s  approach  is  that  it  is  re¬ 
stricted  to  minimizing  a  known  linear  objective  function 
that  is  subject  to  linear  constraints.  The  method  can  be  ex¬ 
tended  to  constrained  nonlinear  programming  in  which  the 
goal  is  to  minimize  a  known  function  f(x)  subject  to  a  set 
of  equality  and  inequality  constraints  of  the  form  g,(x )  <  0. 
In  general,  however,  criteria  based  on  other  than  intuition 
need  to  be  found  for  selecting  the  function  /  to  be  mini¬ 
mized.  To  do  this,  one  can  apply  the  calculus  of  variations 
(see  for  example  Courant  and  Hilbert  1053,  chapter  IV). 
The  familiar  differential  shows  how  to  Bud  a  real  valued 
parameter  that  minimizes  some  function.  The  calculus  of 
variation  extends  the  differential  calculus  by  showing  how 
one  can  determine  a  function  f  ,  which  is  subject  to  a  given 
set  of  boundary  conditions,  and  minimizes  the  integral 

/(/)  =  /  Jn  r(x,  /„,  /*„  Iyy)dxdy  (17) 

over  a  given  region  of  integration  <?f.  The  function  F  is 
called  a  ’'performance  index’’  and  generalizes  the  notion  of 
cost  function  associated  with  linear  and  nonlinear  program¬ 
ming.  In  the  next  section  we  shall  consider  the  choice  of  a 
performance  index  for  interpolating  smooth  surfaces  from 
one-dimensional  boundary  conditions. 

Associated  with  a  variational  problem  of  the  form  (17) 
is  the  Euler  equation,  which  provides  a  necessary,  though 
by  no  means  sufficient,  condition  which  a  function  /  must 
satisfy  if  it  is  to  minimize  the  variational  integral  7(f).  For 
the  particular  variational  problem  given  in  equation  (17), 
the  Euler  equation  is 

F‘~TxFl~h  ' +  + £dy 


d2 

Ff"rWFf'’  =  °‘ 


(18) 

In  the  case  that  there  is  only  a  single  dependent  variable 
x,  the  partial  derivatives  are  total  tud  the  Euler  equation 
becomes 


d  d2 

'V-S'/.  +  Ss'/--0- 


(19) 


There  are  two  important  considerations  associated  with 
the  use  of  the  calculus  of  variations.  First,  unlike  the 
differential  calculus,  the  existence  of  an  extremum  /  of 
the  integral  given  in  equation  (17)  cannot  be  taken  for 
granted.  Courant  and  l!ilbcrt(1953,  p.  173)  note  that  ”a 
characteristic  difficulty  of  the  calculus  of  variations  is  that 
problems  which  can  be  meaningfully  formulated  may  not 
have  solutions”.  Conditions  for  the  existence  of  a  minimum 
have  recently  been  proposed  by  Grimson(1981)  and  will  be 
discussed  in  the  next  section. 

Second,  associated  with  any  variational  problem  is  a  set 
of  natural  boundary  conditions  which  imposes  a  necessary 
condition  on  any  feasible  solution  to  the  Euler  equation  at 
the  boundary.  Courant  and  !Iilbcrl(lSj3,  p.  211)  note  that 
”in  general,  we  can,  by  adding  boundary  terms  or  bound¬ 
ary  integrals  essentially  modify  the  natural  boundary  con¬ 
ditions  without  altering  the  Euler  equations”.  Determining 


the  ’’most  conservative”  solution  means  finding  a  perfor¬ 
mance  index  that  guarantees  the  existence  of  an  extremum 
function  /"  and  provides  the  tightest  set  of  natural  bound¬ 
ary  conditions  that  are  consistent  with  the  given  data. 

The  calculus  of  variations  has  recently  been  applied 
by  a  number  of  authors  to  interpolate  plane  and  space 
curves  and  surfaces.  We  review  the  applications  in  that 
order.  First,  Horn(1981)  lias  recently  determined  the  curve 
which  passes  through  two  specified  points  with  specified 
orientation  while  minimizing 

J  k2(Is,  (20) 

where  x  is  the  curvature  and  s  is  the  arc  length.  This  is 
the  true  shape  of  a  spline  used  in  "lofting”  (Faux  and  Pratt 

197D,p.  228).  In  a  thin  beam,  curvature  is  proportional  to 
the  bending  moment.  The  total  elastic  energy  stored  in  a 
thin  beam  is  therefore  proportional  to  the  integral  of  the 
square  of  the  curvature.  Since  the  shape  taken  on  by  a  thin 
beam  is  the  one  which  minimizes  the  internal  strain  energy, 
the  curve  that  solves  equation  (20)  is  called  the  ’’curve  of 
least  energy”.  The  variational  problem  is  to  minimize 

[-&-dx. 

J  (i  +  /;)* 

This  has  the  form  of  equation  (17).  Horn(1981,  page  19) 
shows  that  the  Euler  equation  is 


where  rl>  is  the  angle  between  the  tangent  to  the  curve 
and  the  axis  or  symmetry.  The  solution  to  this  differential 
equation  is  an  incomplete  elliptic  integral  of.  the  first  kind. 
Brady,  Crimson,  and  I.angridgc(l980)  consider  a  ’’small 
angle"  approximation  to  the  curve  or  least,  energy,  in  which 
first  derivatives  can  be  ignored.  The  performance  index 
that  they  use  is  f2z,  for  rrasons  that  will  become  evident  in 
the  next  section.  They  find  that  in  that  case  the  solution  is 
a  cubic.  Horn(1981  ,pagc  2)  notes  that  the  fact  that  a  curve 
has  near  minimum  energy  does  not  mean  that  it  lies  close 
to  the  curve  of  minimum  energy.  Note  that  the  existence  of 
the  curve  of  least  energy  is  guaranteed  as  Horn  has  derived 
an  analytical  formula  for  it.  Approximations  to  it,  such  as 
Brady,  Grimson,  and  Langridge’s  are  similarly  guaranteed 
to  exist. 

Barrow  and  Tenenbaum(1981)  investigate  the  problem 
of  interpreting  a  line  as  the  image  of  a  space  curve  that 
is  the  occluding  edge  of  an  object.  They  observe  that  the 
problem  has  two  parts:  (i)  determining  the  tangent  vector 
t  at  each  point  on  the  space  curve,  and  (ii)  determining  the 
surface  normal  at  each  point,  given  that  it  is  constrained  to 
be  orthogonal  to  the  tangent.  They  suggest  minimizing  a 
performance  index  F  that  is  a  function  of  the  curvature  it 


'  For  simplicil  y  of  proton  lotion,  wr  restrict  ntten  lion  to  functions  / 
of  one  or  two  variable*  x,y. 


and  the  torsion  r  (possibly  together  with  their  derivatives), 
and  expresses  a  suitable  notion  of  "smoothness”.  They  first 
consider  uniformity  of  curvature  as  a  measure  of  smooth¬ 
ness,  that  is  F  ==  —  Ki ,  where  s  measures  distance  along 

the  space  curve.  They  r'-’ect  this  measure  on  the  grounds 
that  can  be  made  arbitrarily  small  by  "stretching  out 
the  space  curve  so  that  it  approaches  a  twisting  straight 
line".  To  overcome  this  dilliculty,  they  propose  that  the 
space  curve  should  also  be  "as  planar  as  possible  or,  more 
precisely,  that  the  integral  of  its  torsion  should  be  mini¬ 
mized”  . 

Darrow  and  Tcnenbaum  finally  suggest  finding  the  space 
curve  that  projects  to  the  given  image  line  and  minimizes 
the  performance  index  where  6  is  the  binormal. 

They  report  that  an  algorithm  based  on  their  analysis 
produced  the  "correct  3-D  interpretations  for  simple  and 
closed  curves,  such  as  an  ellipse,  which  was  interpreted  as  a 
circle” .  However,  they  note  that  the  rate  of  convergence  was 
slow  and  dependent  oo  the  initial  data.  No  consideration 
is  given  to  the  Euler  equations,  to  the  existence  of  an  ex¬ 
tremum  given  a  line  drawing  (z(s),  j /(«)},  or  to  the  natural 
boundary  conditions  associated  with  the  performance  index 
[-jf^]2.  Empirical  evidence  that  the  method  works  on  a 
number  of  simple  test  cases  is  encouraging;  but  there  is  no 
analysis  of  the  scope  of  the  method. 

In  the  same  paper,  Harrow  and  Tcncnbaum(l98l)  con¬ 
sider  the  interpolation  of  a  smooth  surface  from  depth  and 
local  surface  orientation  values  at  all  points  along  the  sur¬ 
face  boundary.  Their  approach  is  to  "seek  a  technique 
that  yields  exact  reconstructions  for  the  special  symmetric 
cases  of  spherical  and  cylindrical  surfaces,  as  well  as  in¬ 
tuitively  reasonable  reconstructions  for  other  smooth  sur¬ 
faces.”  (Barrow  and  lenenbaum  1981).  They  observe  that 
if  n  is  the  surface  normal  of  a  cylinder,  then  the  x  and  y 
components  of  the  normal  nx  and  ny  are  linear  functions  of 
x  and  y,  so  long  as  the  axis  of  the  cylinder  lies  in  the  xy- 
plane.  This  observation  forms  the  basis  of  an  algorithm  to 
estimate  the  surface  normal  by  least  squares  fitting  of  the 
parameters  of  the  partial  derivatives  of  the  normal.  As  be¬ 
fore,  no  analysis  is  given  or  the  Euler  equation,  the  natural 
boundary  conditions,  nor  the  conveigcnce  of  their  algorithm 
for  different  types  of  surface. 

5.  A  performance  index  for  surface  interpola¬ 
tion. 

In  the  review  of  the  application  of  the  calculus  of 
variations  to  visual  perception  in  the  previous  section  we 
drew  attention  to  three  important  considerations.  First, 
the  Euler  equations  provide  a  necessary  condition  on  pos¬ 
sible  extremal  functions.  Second,  the  existence  of  an  ex¬ 
tremum  cannot  be  taken  for  granted,  even  when  the  mini¬ 
mization  problem  seems  plausible  on  Borne  grounds.  Third, 
the  natural  boundary  conditions  impose  a  necessary  con¬ 
dition  on  any  feasible  solution  to  the  Euler  equation  at 
the  boundary.  The  most  thorough  analysis  of  the  second 


of  these  problems  in  Computer  Vision,  framed  in  the  con¬ 
text  of  surface  interpolation,  is  due  to  Grimson(1981),  who 
proves  the  following  theorem. 

Theorem  (Grimson,  see  ltudin(1973))  Suppose  there  ex¬ 
ists  a  complete  semi-norm  F  on  a  space  of  functions  7,  and 
that  F  satisfies  the  parallelogram  law.  Then,  every  non¬ 
empty  closed  convex  set  <f  C  7  contains  a  unique  element 
/'  of  minimal  norm  /’(/"),  up  to  possibly  an  clement  of  the 
null  space  of  F. 

A  semi-norm  F  is  a  function  V  >-»  from  a  vector 
space  V  to  the  positive  real  numbers  that  satisfies 

F(v  +  te)  <  F(v)  +  F(w) 

F(av)  =  MFM. 

Informally,  a  semi-norm  is  a  generalization  of  the  Euclidean 
metric,  and  provides  a  measure  of  a  vector.  The  first  con¬ 
dition  generalizes  the  triangle  inequality,  for  example.  The 
null  space  of  the  semi-norm  F  consists  of  all  those  vectors 
Vo  that  map  to  zero.  Since 

F{v  +  v0)  =  F(v), 

any  clement  of  the  null  space  can  be  added  to  a  vector  of 
minimal  norm  to  yield  another  vector  of  minimal  norm. 
Hence  the  qualifying  phrase  "unique  ...  up  to  possibly  an 
element  of  the  null  space  of  F" .  The  parallelogram  law 
states  that 

[F(v  +  ui)]2  +  \nv  -  w)}2  =  2[F(v)]s  +  2[F»)J, 

for  all  vectors  v,w.  Finally,  the  semi-norm  is  complete  if  all 
Cauchy  sequences  converge.  As  is  well  known,  the  elements 
of  vector  spaces  can  be  functions.  This  enables  Grimson  to 
prove  the  following  Corollary,  that  guarantees  the  existence 
of  an  extremum  function  in  calculus  of  variations  "most 
conservative”  interpolation  problems. 

Corollary  (Grimson  1981).  Let  the  set  of  known  points 
be  {(x,,y,,  x,)  |  1  <  »  <  n).  Let  7  be  a  vector  space  of 
possible  functions  on  3?2  and  let  £  be  the  subset  of  7  that 
interpolates  the  known  data.  That  is,  for  all  functions 
fc£,  /(x,,  y,)  =  g,.  Let  F  be  a  complete  semi-norm  on  £ 
that  satisfies  the  parallelogram  law.  Then  there  exists  a 
unique  (up  to  the  null  space  of  F)  function  /'  that  inter¬ 
polates  the  data  and  has  minimal  norm.  In  particular,  if 
F  is  a  performance  index  then  there  is  a  function  /  that 
minimizes  the  integral 

m  ~fr. 

In  short,  if  the  conditions  of  the  Corollary  are  fulfilled, 
the  existence  of  a  "most  conservative"  surface  that  meets 
the  boundary  conditions  is  guaranteed.  As  we  shall  see, 
the  condition  of  being  a  3cmi-norm  is  the  most  restrictive 
required  of  the  performance  index.  The  condition:  are 
sufficient  to  guarantee  the  existence  of  a  minimum,  but  they 
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are  not  necessary.  For  example.  <cJ  is  not  a  seminoma*; 
nevertheless  Horn '8(1981 )  uialysis  shows  that  there  is  a 
unique  minimum.  It  is  far  from  clear  whether  Barrow  and 
Tenenbaum’s(1981)  analyses  of  curve  and  surface  interpola¬ 
tion  have  a  guaranteed  niiuimura  in  all  cases. 

Grimsou  notes  that  several  intuitively  plausible  perfor¬ 
mance  indices  are  not  semi-norms.  For  example,  the  two 
most  popular  measures  of  curvature  arc  not.  Suppose  that 
and  kj  are  the  principal  curvatures  of  a  surfacc(Faux 
and  Pratt  1979,  p.  Ill),  then  the  Gaussian  curvature  it,  is 
the  product  k\K]  and  the  mean  curvature  xm  is  the  sum 
*1  +  *2-  for  a  surface  f{x,  y), 

.  _  fxxfyy  ~  fxy 

M7,"(l  +  /*  +  /*)*' 

Since  the  curvatures  can  be  negative,  while  a  semi- 
norm  is  required  to  be  positive,  it  is  necessary  to  investigate 

1 f  • 

Grimso/i(l981)  observes  that  K*(a /)  yt  jft|/c*(/)  because  of 
the  denominator.  If  fx  and  fy  arc  small,  the  denominator 
is  approximately  equal  to  one,  and  so  the  expression  is 
approximately  equal  to  the  numerator.  Note  that  it  is 

fxxfyy  fxy>  (21) 

Crimson  shows  that  the  mean  curvature  xm  is  also  not 
a  semi-norm  for  exactly  the  same  reason.  The  analogous 
small  angle  approximation  is 

(/»  +  /.  ,f=(A/)S. 

the  square  Laplacian,  which  is  a  semi-norm.  We  find  it  con¬ 
venient  to  denote  the  square  I  .aplacian  by  Ft.  Gnmson(198l) 
chooses  the  quadratic  variation 

fix  +  2/iy  +  f\v> 

on  the  grounds  that  its  null  space,  consisting  of  all  linear 
functions,  is  smaller  than  the  null  space  of  the  square 
Laplacian.  If  we  denote  the  quadratic  variation  by  F,,  we 
see  that  the  approximation  to  the  Gaussian  curvature  given 
in  equation  (21)  is 

How  shall  we  choose  a  performance  index  for  surface 
interpolation,  given  that  it  has  to  satisfy  the  conditions  of 
the  Corollary?  We  have  exhibited  three  c ana. dates,  are 
there  more?  Notice  first  that  each  of  the  serai-norms  given 
above  arc  quadratic  forms  in  }zx,  fxy,  and  fyy.  It  is  easy  to 
show  that  any  quadratic  form  satisfies  the  semi- norm  and 
parallelogram  conditions,  and  so  there  is  an  infinite  set  of 
plausible  semi-norms  to  use  to  find  the  "most  conservative” 
interpolated  surface.  We  need  an  extra  condition,  and  the 
one  we  choose  is  rotational  symmetry,  since  we  suppose  that 
surface  interpolation  is  an  isotropic  process.  Proposition  6 
of  section  3  shows  that  the  rotationally  symmetric  quad¬ 
ratic  forms  in  fxx,  fxy,  and  fyy  form  a  vector  space  that 


has  the  square  Laplacian  and  the  quadratic  variation  as  a 
basis.  The  choice  of  which  performance  index  to  use  is  thus 
effectively  reduced  to  the  square  Laplacian,  the  quadratic 
variation,  and  linear  combinations  of  them.  How  shall  we 
choose  between  those  two?  In  the  light  of  our  earlier  discus¬ 
sion,  two  criteria  suggest  themselves:  the  Euler  equations 
and  the  natural  boundary  conditions. 

Proposition  7.  All  rotationally  symmetric  quadratic 
forms  lead  to  an  identical  Euler  equation 

A2(/)  =  0. 

Proof.  We  exploit  the  fact  that  the  square  Laplacian 
and  the  quadratic  variation  arc  a  basis  of  the  rotationally 
symmetric  quadratic  forms. 

a.  Square  Laplacian:  The  performance  index  is 

Fl  =  ( fix  +  fyyf- 

By  equation  (18)  the  Euler  equation  is 

^j{2  (fxx  +  /yy)}  +  ^*{2(/**  +  =  °’ 

that  is 

(A/)1  =  0, 

as  required. 

b.  Quadratic  variation:  The  Euler  equation  is 

2/xiIX  +  *  fxy  XV  +  2/yyyy  0, 

that  is 

(A/)*  =  0, 

provided  that  /  is  continuous  of  fourth  order. 

c.  Linear  combinations  of  Ft  and  F,.:  Linear  combin 
tions  clearly  give  rise  to  the  identical  Euler  equation^ 

The  gist  of  Proposition  7  is  that  there  is  no  difference 
between  F,  and  F(  a  the  interior,  away  from  the  boundary. 
We  can  see  the  result  of  Proposition  7  in  an  alternative 
interesting  way.  Recall  that 

Ft  —  Fq  =  2(/xx/yy  /  xy)l 

is  the  semi-norm  approximation  to  the  Gaussian  curvature 
(equation  (21)).  The  latter  expression  is  an  instance  of  a 
divergence  expression,  and  Courant  and  Hilbcrt(1953,  p. 
196)  note  "If  the  difference  between  the  integrands  of  two 
variational  problems  is  a  divergence  expression,  then  the 
Euler  equations  and  therefore  the  families  of  extremals  are 
identical  for  the  two  variational  problems.” 


t  Which  it  why  Brady,  Grimton,  and  Langridge(1980)  used  the  small 
ao ’!«  approximation  /Ja . 
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where  //  is  the  Hessian  matrix 


Since  F,  and  Ft  have  identical  Euler  equations,  we 
analyze  their  natural  boundary  conditions  m  order  to  choose 
between  them.  We  could  approach  this  problem  directly; 
but  a  more  revealing  route  is  available.  Couraut  and 
llilbcrl(1953,  p250)  consider  the  statics  of  a  thin  plate,  'n 
particular  they  determine  the  shape  it  assumes  for  a  given 
force  p(s)  along  its  boundary  F  and  bending  moment  m(s) 
normal  to  its  boundary. 

Courant  and  Ililbert  note  that  tbe  energy  stored  in  the 
plate  is  the  integral  of  a  quadratic  form  in  the  principal 
curvatures  icj  and  « 2  of  the  surface,  a  result  which  can  be 
derived  by  noting  that  the  elastic  energy  stored  in  a  thin 
strip  (corresponding  to  any  normal  section)  is  proportional 
to  the  square  curvature  It  follows  that  the  stored  energy 
is  locally 

£1  =  q(/cJ  H-  *2)  +  2/Jki«i 

=  a(*i  +  x2)2  +  2(/3  —  q)/Ciici 

=  QKm  +  2  (ji  —  a)/t„ 

If  we  assume  that  the  first  derivatives  arc  small,  we  can  use 
the  same  approximation  to  the  curvature  used  in  equation 
(21): 

(Ft  -  Fq) 

~  aFt  +  2(0  -  a) - 

=  m  +  («  -  0)f„ 

=  a(nF,  +  (1  -  n)F„), 

where  p  =  ~.  It  follows  that  tbe  energy  stored  in  the  thin 
plate  is  a  convex  linear  combination  of  the  squar"  Laplacian 
and  the  quadratic  variation,  which  formally  establishes  its 
connection  to  the  visual  perceptual  problem  studied  here. 
Observe  that  setting  the  weight  fi  —  1  gives  the  square 
Laplacian,  while  setting  it  equal  to  zero  gives  the  quadratic 
variation. 

Energy  is  also  associated  with  the  external  force  and 
moment  applied  at  the  boundary.  Let  the  force  per  unit 
length  be  p(s)  along  the  boundary  F  of  the  plate  and  a 
bending  moineut  m(s)  applied  normal  to  the  plate.  Courant 
and  Hilbert(1953,  p.  251)  show  that  the  natural  boundary 
conditions  associated  with  the  plate  are 

P(S)  =  -A/  +  V1  —  /l)(/« l]  +  2flyX,y,  fyyV]) 

m(s)  =  Aa  / 

Q 

+  (1  —  +  fty{x,yn  +  x„y.)  +  fvvy  ,yn), 

at 

where  x,  and  x„  are  the  partial  derivatives  of  x  in  the 
directions  of  the  tangent  and  normal  to  the  boundary  of 
the  thin  plate.  Similar  for  y,  and  y,.  That  is, 

—A/  +  (1  -  p)([xsyaj//(iJy.,)T)  =  p{») 

^A /  +  (1  -  *0^([z»y»]  11  —  W(®)> 


fxx  fxy 

f*V  /vvj 

Cdadwcll  and  Wait(1979)  quote  version  of  this  result 
due  to  Agmon(19G5),  that  the  bihannonic  operator,  which 
we  showed  was  the  natural  boundary  condition  for  tbe  sur¬ 
face  interpolation  problem,  has  l)ii  iclilet  forms  that  are 
linear  combinations  of  the  square  Laplacian  and  the  quad¬ 
ratic  variation.  As  an  example  of  the  constraint,  consider  a 
straight  line  contour  aligned  with  the  .r-axis.  Then  [xty„\  = 
[l  0]  and  [xn  y„]  =  [0  lj.  The  natural  boundary  conditions 
reduce  to 

fyy  4"  M/**  =  — P(«) 
fyyy  4”  (2  l*)jyxx  =  *71(3). 

The  constraint  is  tightest  when  p  is  not  equal  to  one. 
A  similar  result  can  be  obtained  for  a  straight  line  contour 
inclined  at  an  angle  a  to  the  x-axis.  The  first  of  the  natural 
boundary  conditions  is 

/„(sinJ  or  -f  p  cos2  o)  +  /vv(cos2  a  +  p  sin2  q) 

+  (1  —  p)sin2of/Iy 

If  p  =  1,  there  is  no  constraint  from  the  cross  derivative. 
If  p  is  not  equal  to  1,  at  most  one  of  the  terms  can  be 
zero.  We  conclude  that  interpolation  problems  in  which  the 
small  angle  approximations  used  throughout  our  analysis 
hold  it  is  preferable  to  choose  p  not  equal  to  one,  that  is 

to  say  to  not  use  the  square  Laplacian  as  a  performance 
index.  The  quadratic  variation  is  an  obvious  choice,  but 
so  are  linear  combinations  of  the  square  Laplacian  and 
the  quadratic  variation  for  which  p  is  not  equal  to  one. 
Griinson(1981)  chooses  the  quadratic  variation  since  its  null 
space  is  smaller  than  that  of  the  square  Laplacian.  This  is 
a  precise  way  of  saying  that-  it  imposes  a  tighter  constraint. 
For  example,  tbe  function  f(x,  y)  =  xy  is  in  the  null  space 
of  the  square  Laplacian  but  not  in  the  null  space  of  the 
quadratic  variation.  Since  the  quadratic  variation  has  the 
srrallcst  uull  space  amoug  the  linear  combinations  of  the 
square  Laplacian  and  quadratic  variation,  all  linear  com¬ 
binations  have  the  same  null-space  ?  only  the  Laplacian 
itself  is  different...  this  is  an  additional  reason  for  choos- 
iug  it.  We  would  further  expect  that  any  differences  be¬ 
tween  the  quadratic  variation  and  the  square  Laplacian 
would  show  up  near  the  given  boundary  data  but  not  in 
tbe  interior,  far  removed  from  the  boundary.  This  is  what 
Grimson(198l)  finds  in  a  set  of  examples  that  compare  sur¬ 
faces  interpolated  using  the  quadratic  variation  and  the 
square  Laplacian. 
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,  Abstract 

IVe  describe  a  new  approach  to  low-level  vision  in  which 
the  task  of  image  segmentation  is  to  distinguish  meaningful 
relationships  between  image  elements  from  a  background 
distribution  of  random  alignments.  Unlike  most  previous 
approaches,  which  start  from  idealized  models  of  what  we 
wish  to  detect  in  the  world,  this  approach  is  not  based  on 
prior  world  knowledge  and  uses  measurements  which  can 
be  computed  directly  from  the  input  signal.  Groupings  of 
image  elements  are  formed  over  a  wide  range  of  sizes  and 
classes  while  attempting  to  make  use  or  all  available  statis¬ 
tical  information  at  each  level  of  the  grouping  hierarchy, 
resulting  in  far  more  sensitive  discrimination  than  is  pos¬ 
sible  from  just  local  measurements.  This  paper  explores  the 
range  of  grouping  capabilities  and  discriminations  exhibited 
by  the  human  visual  system  and  discusses  the  application 
of  the  mcaningfulness  measure  to  each  of  them. 

Introduction 

The  human  visual  system  has  the  capability  of  spon¬ 
taneously  delecting  many  very  general  classes  of  patterns 
in  an  image,  even  when  there  is  no  high-level  or  seman¬ 
tic  knowledge  available  to  guide  the  interpretation.  Figure 
1  gives  some  examples  of  the  range  of  this  capability,  in¬ 
cluding  the  detection  of  colinearity,  predominant  orienta¬ 
tion,  bilateral  and  rotational  symmetry,  and  repetition  in 
an  otherwise  random  field  of  dots.  Computer  vision  systems 
currently  lack  almost  all  of  these  early  vision  capabilities, 
with  the  exception  of  edge  detection.  Even  the  edge  detec¬ 
tion  capabilities  of  current  computer  programs  are  far  below 
the  level  of  human  performance. 

The  importance  of  early  vision  and  spontaneous  image 
organisation  has  long  been  recognized,  and  has  gone  un¬ 
der  names  such  as  image  segmentation,  figure/ground 
phenomena,  perceptual  grouping,  and  gestalt  organization, 
all  of  which  emphasize  the  selection  of  subsets  of  image  ele¬ 
ments  which  somehow  naturally  belong  together.  A  group¬ 
ing  is  successful  to  the  extent  that  it  brings  together  ele¬ 
ments  in  the  image  that  have  arisen  from  the  same  process 
or  belong  to  the  same  object  in  the  three-dimensional  world 
being  viewed.  These  groupings  can  then  greatly  reduce  the 
combinatorics  of  the  search  space  when  forming  higher  levels 
of  grouping,  matching  against  world  knowledge,  making  use 


of  texture  properties,  or  searching  for  correspondences  as  in 
stereo  vision. 

Previous  methods  for  image  segmentation  have  usually 
been  derived  from  an  idealized  model  of  the  world  (such  as 
the  step  edge  model  often  used  in  edge  detection  or  regular 
texture  models  used  in  texture  description).  The  approach 
taken  here  is  very  different  and  is  largely  free  of  prior  ex¬ 
pectations  about  the  structure  of  the  world.  The  central 
concept  of  this  paper  is  that  it  is  possible  to  calculate — 
in  a  domain-independent  way— -a  statistical  measure  of  the 
likelihood  that  some  grouping  truly  reflects  an  interdepen¬ 
dence  of  its  subparts  in  the  world  (i.c.,  that  the  grouping 
is  not  the  result  of  a  random  alignment  of  independent  ele¬ 
ments).  This  measure  can  be  applied  to  groupings  at  all 
resolutions  and  positions  in  the  image  to  determine  which 
ones  arc  the  most  meaningful  and  are  most  likely  to  lead 
to  further  correct  interpretations.  There  is  no  need  to  as¬ 
sume  a  certain  level  of  “noise”  in  the  image,  since  group¬ 
ings  at  all  resolutions  (allowing  all  ranges  of  variations)  can 
be  examined,  and  those  which  arc  most  meaningful  (carry 
the  strongest  statistical  implications)  can  be  selected  as  the 
most  useful  description  of  the  structure.  To  give  a  practical 
example,  it  may  be  important  for  further  interpretation  to 
recognize  that  the  edge  of  a  tree  trunk  is  essentially  straight, 
even  though  our  eye  is  able  to  resolve  many  small  perturba¬ 
tions  along  its  length.  In  this  case  it  would  be  important 
to  generate  at  least  two  different  meaningful  descriptions 
for  the  same  curve  in  the  image,  corresponding  to  different 
resolutions  of  grouping. 

The  importance  of  making  precise  measurements  of  the 
meaningfulness  of  each  grouping  is  most  apparent  when  at¬ 
tempting  to  derive  strong  global  information  from  locally 
weak  information.  For  example,  an  edge  in  a  digitized 
image  may  be  indistinguishable  over  each  small  neighbor¬ 
hood  along  the  edge  from  background  sensor  noise,  aa 
many  researchers  attempting  to  derive  local  edge  detectors 
have  discovered.  However,  if  we  make  many  local  measure¬ 
ments  of  meaningfulncss  at  different  orientations,  and  then 
measure  the  statistical  likelihood  that  different  groupings 
of  local  measures  would  happen  to  align  themselves  into 
a  longer  smooth  edge,  we  are  able  to  derive  much  higher 
measures  of  meaningfulness  for  the  overall  edge  than  for 
the  local  measurements.  This  implies  that  a  long  edge  is 
more  detectable  than  a  short  edge  and  that  a  straight  edge 
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Figure  li  Some  examples  of  the  human  ability  to  detect  various  classes  of  patterns  in  an  otherwise  random  field  of 
dots:  (a)  the  basic  pattern  of  110  dots  positioned  at  random;  (b)  a  string  or  dots  is  added  which  has  nearest  neighbor 
distances  approximately  the  same  as  the  background  average;  (c)  the  field  in  (a)  is  shifted  diagonally  and  overlayod  on 
itself,  resulting  in  a  statistically  predominant  peak  in  orientation;  (d)  the  field  of  dots  is  reflected  about  a  vertical  axis, 
producing  bilateral  symmetry;  (e)  the  upper  right  quadrant  is  rotated  into  the  other  quadrants  to  produce  four-fold 
circular  symmetry;  (f)  the  left  third  of  the  image  is  shifted  and  reproduced  twice  to  form  a  repetitive  pattern. 


is  more  detectable  than  one  with  many  sharp  twiRts  and 
turns,  something  apparently  also  true  of  human  vision.  A 
similar  need  for  global  combination  of  weak  local  measures 
applies  to  all  of  the  examples  in  Figure  1. 

Although  the  tneaningfulncss  measures  applied  to  each 
grouping  arc  domain-independent  and  based  on  purely 
geometrical  measures  of  non-randomness,  there  is  still  con¬ 
siderable  leeway  in  choosing  the  groupings  to  subject  to 
these  tests.  To  test  all  possible  combinations  of  image  ele¬ 
ments  would  be  an  exponentially  expensive  process.  The 
human  visual  system  certainly  does  not  recognize  all  mean¬ 
ingful  groupings:  as  is  shown  in  Figure  2,  a  pattern  such  as 
five  equally-spaced  dots  aligned  in  a  row  (which  is  very  un¬ 
likely  to  have  arisen  by  random)  will  not  be  spontaneously 
detected  if  it  is  surrounded  by  enough  similar  elements.  Our 
approach  to  defining  the  sets  of  groupings  to  be  considered 
is  to  use  several  general  principles  (such  as  scale  invariance 
and  linear  time  computational  complexity)  and  to  otherwise 
rely  on  data  regarding  human  performance. 


The  Measurement  of  Meaningfulness 

By  rneaningfv/neas  we  mean  a  measure  of  how  likely 
some  grouping  is  to  have  arisen  from  an  underlying  physical 
relationship  between  the  constituent  features  rather  than 
through  some  accident  of  viewpoint  or  location.  It  is  impor¬ 
tant  to  have  quantitative  measures  of  meaningfulness  so  that 
more  global  combinations  of  local  measures  will  have  precise 
information  to  work  with  in  evaluating  the  significance  of 
each  combination.  Some  commonly  used  operations,  such  as 
thresholding  ami  linear  convolutions,  destroy  a  great  deal  of 
this  information. 

We  calculate  the  probability  that  each  selected  group¬ 
ing  in  the  image  could  have  arisen  from  a  random  pertur¬ 
bation  of  the  surrounding  distribution  of  similar  features. 
If  this  is  unlikely,  then  as  is  commonly  done  in  inferen¬ 
tial  statistics,  wc  infer  that  nonchance  factors  are  probably 
responsible  for  the  grouping.  It  is  common  in  inferential 
statistics  to  specify  a  threshold  at  which  something  is  con¬ 
sidered  to  be  meaningful  (e.g.,  a  0.05  significance  level,  or 
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alpha  level,  corresponding  to  one  chance  in  20  that  such  an 
unusual  result  would  have  arisen  from  random  data).  We 
use  a  threshold  such  as  this  when  deciding  which  results 
to  display  during  output,  but  otherwise  there  is  no  need 
for  any  type  of  thresholding.  The  significance  values  can 
be  combined  into  higher  levels  with  their  own  significance 
based  on  the  fact  that  two  events  which  arc  unlikely  to  have 
occurcd  at  random  are  even  more  unlikely  to  have  occured 
together. 

The  most  common  way  to  measure  the  likelihood  of 
some  value  is  to  look  at  the  shape  of  the  distribution  in 
which  it  is  embedded  and  measure  what  proportion  of  the 
distribution  has  values  at  least  as  extreme  as  the  one  un¬ 
der  consideration  (for  example,  we  may  assume  a  Gaussian 
distribution  and  compare  the  value  to  the  standard  devia¬ 
tion).  However,  in  the  types  of  vision  problems  we  arc  con¬ 
sidering  there  is  no  particular  knowledge  of  the  shape  of 
the  distribution  and  we  do  not  have  enough  data  to  reliably 
derive  the  shape  from  the  data  for  each  region.  Therefore, 
we  have  chosen  to  use  a  distribution-free  way  of  calculating 
likelihoods  based  on  the  methods  of  nonparametric  statis¬ 
tics.  A  nonparametric  test  of  significance  makes  no  assump¬ 
tions  concerning  the  shape  of  the  parent  distribution  or 
population.  Its  measure  of  likelihood  is  based  on  a  rank 
test  which  measures  what  proportion  of  all  combinations  of 
rankings  have  values  ranked  higher  in  the  surrounding  dis¬ 
tribution  than  the  values  under  consideration. 

We  use  discrete  statistics  above  because  that  is  the 
nature  of  the  vision  problem.  The  image  data  is  not  a 
continuous  function  but  a  set  of  discrete  values,  and  further 
groupings  consist  of  discrete  sets  of  these  initial  values.  That 
is  one  reason  why  we  emphasise  the  grouping  of  isolated 
points  in  our  examples  rather  than  working  with  gray-scale 
images  that  superficially  appear  continuous-  the  discrete 
versions  arc  probably  more  accurate  reflections  of  what  the 
individual  stages  of  human  vision  have  to  work  with. 

Which  Groupings  Should  be  Tested? 

As  mentioned  previously,  it  would  be  combinatorially 
expensive  to  examine  all  possible  groupings  in  an  image. 
It  would  be  prohibitive  to  even  form  all  pairs  of  features, 
let  alone  the  larger  sets.  The  human  visual  system  clearly 
detects  only  certain  classes  of  meaningful  patterns,  and  ex¬ 
perimental  work  has  been  carried  out  to  explore  this  range 
of  performance.  Within  the  computer  vision  community, 
Marr  [8,  9)  in  conjunction  with  Riley  [11]  and  Stevens 
[13]  has  discussed  the  importance  of  grouping  operations 
and  has  demonstrated  many  informal  psychophysical  ex¬ 
periments  testing  human  performance.  Marr  emphasized 
the  need  for  grouping  on  the  basis  of  length,  orientation, 
size,  contras  ,  and  spatial  density.  However,  very  little  of 
this  work  was  fully  spr  ified  or  implemented  in  computer 
programs.  Within  the  psychology  community  there  have 
been  many  investigations  of  human  performance  on  specific 
grouping  problems.  Julesz  [6]  has  carried  out  many  experi¬ 
ments  on  human  performance  in  distinguishing  different  tex¬ 
tures.  Glass  (4,  5]  has  examined  the  perception  of  Moiri 


Figure  2i  Some  patterns,  such  as  the  five  equally  spaced  dots  in 
(a)  are  not  spontaneously  detected  by  human  vision  if  they  are 
surrounded  by  enough  similar  elements,  as  in  (b),  even  though 
the  five  dots  remain  highly  meaningful  in  the  statistical  sense. 

patterns  of  the  sort  shown  in  Figure  lc.  Many  researchers 
have  been  intrigued  by  the  human  ability  to  detect  bilateral 
symmetry  in  otherwise  random  images  (as  in  Figure  id), 
and  quantitative  experiments  on  human  performance  in  the 
face  of  perturbations  in  the  symmetry  have  been  carried  out 
by  Harlow  and  Reeves  [l]  and  Hrucc  and  Morgan  [3],  With 
the  exception  of  some  of  Marr’s  work,  all  of  this  research 
has  focussed  on  performance  rather  than  mechanism. 

In  addition  to  data  on  human  performance  there  are 
several  other  constraints  on  the  classes  of  groupings  which 
should  be  formed.  One  is  the  principle  of  scale  invariance, 
which  means  that  the  same  groupings  should  be  formed  over 
a  wide  range  of  different  sizes  in  the  image.  This  is  just 
another  case  of  the  principle  of  viewpoint  invariance,  which 
also  implies  the  obvious  position  and  rotation  invariance. 
The  practical  implication  or  scale  invariance  is  that  the  same 
grouping  operators  must  be  applied  at  a  range  of  different 
sizes  in  the  image  (usually  chosen  to  increase  by  powers  of 
two).  Another  constraint  is  that  the  classes  of  groupings 
attempted  should  be  of  linear  time  complexity  in  terms  of 
the  number  of  items  being  grouped.  Since  we  arc  dealing 
with  such  large  numbers  of  elements,  any  attempt  at  higher 
order  complexity  would  seem  to  be  too  computationally  in¬ 
tensive.  Therefore,  we  can  only  examine  groupings  betwe  n 
each  feature  and  a  fixed  number  of  neighboring  features, 
relying  on  lower  resolution  operators  to  connect  features 
which  are  more  physically  distant  in  the  image.  When  fea¬ 
tures  cannot  be  grouped  at  a  lower  resolution  and  are  too 
distant  to  be  grouped  at  a  high  resolution  (as  in  Figure  2), 
then  the  grouping  will  not  be  detected. 

Only  the  very  lowest  levels  of  grouping  operate  directly 
on  the  image  intensity  data.  Other  levels  combine  the  results 
of  previous  groupings,  looking  for  meaningful  groupings 
of  meaningful  values,  and  in  this  way  build  up  a  layered 
description  of  the  image.  Since  l  here  must  is  some  overlap 
between  selected  groupings  along  each  dimension  in  order 
to  minimize  the  effects  of  discretization,  it  is  possible  to  in¬ 
terpolate  between  neighboring  values  to  precisely  locate  the 
best  description  for  each  feature. 


Linear  groupings.  One  level  of  image  segmentation  that 
has  received  a  great  deal  of  attention  is  the  detection  of 
“edges,”  usually  defined  as  the  detection  of  extended  inten¬ 
sity  discontinuities  in  the  original  scene,  in  keeping  with  our 
philosophy  of  looking  for  statistically  meaningful  groupings 
in  the  image  rather  than  for  the  image  of  some  idealised 
feature  in  the  world,  we  prefer  to  think  of  edge  detection 
as  the  detection  of  meaningful  linear  or  curvilinear  groups 
of  points,  where  the  values  of  the  points  have  already  been 
detected  by  some  earlier  st  ge.  In  this  case  the  earlier 
stage  should  probably  be  an  isotropic  (circularly  symmetric) 
operator  which  calculates  the  ratio  of  center  to  surround 
intensity  (the  strongest  evidence  for  this  stage  of  process¬ 
ing  comes  from  neurophysiological  experiments  measuring 
the  output  of  center-surround  neurons  in  the  retina).  This 
isotropic  operator  would  be  applied  at  a  range  of  resolutions 
over  the  entire  image,  and  at  each  resolution  we  would  look 
at  all  orientations  and  positions  for  linear  sets  of  isotropic 
values  that  were  significant  with  respect  to  the  surrounding 
isotropic  values. 

When  combining  meaningfulness  values  from  indepen¬ 
dent  regions  of  the  image  (such  as  when  combining  the 
values  of  isotropic  operators  that  lie  in  a  linear  arrange¬ 
ment),  the  independent  likelihood  values  are  multiplied 
together  to  produce  the  meaningfulness  measure  for  the  new 
combination.  For  example,  if  there  is  only  one  chance  in  10 
that  some  feature  would  arise  at  random  from  its  surround¬ 
ing  distribution,  and  there  is  a  likelihood  of  5  for  some  other 
independent  feature  aligned  with  it,  then  there  is  only  one 
chance  in  SO  that  a  specific  combination  with  such  unusual 
values  would  have  arisen  from  the  distribution.  However,  if 
we  attempt  to  make  many  groupings  of  some  feature  with 
its  neighbors,  then  we  must  divide  the  likelihood  of  the 
combination  by  the  number  of  attempted  groupings  of  each 
feature  to  compensate  for  the  increased  number  of  group¬ 
ings  being  tested.  These  considerations  all  derive  from  basic 
probability  theory  when  calculating  the  likelihood  that  some 
grouping  would  have  arisen  randomly  from  a  background 
distribution. 

In  addition  to  looking  for  linear  groupings  at  a  full 
range  of  sizes  (resolutions)  in  the  image,  we  also  need  to  look 
at  a  range  of  elongations  (ratios  of  length  to  width).  We  do 
not  know  in  advance  what  the  length  of  a  linear  grouping 
will  be,  and  attempting  to  form  groupings  at  all  elongations 
allows  us  to  combine  the  statistical  information  over  the 
entire  length  of  an  edge  before  deciding  upon  its  significance. 
As  we  increase  the  elongation  it  is  necessary  to  increase  the 
number  of  orientations  being  examined  by  the  same  ratio 
in  order  to  cover  the  full  space  of  possibilities.  However, 
longer  elongations  need  to  be  sampled  less  frequently  in 
the  direction  of  elongation,  so  the  overall  computational 
requirements  remain  constant  as  elongation  is  increased. 
This  example  of  detecting  linear  features  is  worked  out  in 
full  in  the  following  section  of  this  paper,  and  a  computer 
implementation  of  the  algorithm  is  described. 

Some  important  features  of  this  method  for  detecting 
meaningful  linear  structures  ars  that  it  does  not  require 
edges  to  be  continuous  (it  works  well  with  dotted  lines  or 


Figure  3s  There  may  be  more  than  one  peak  in  meaningfulneas 
at  different  resolutions  for  the  same  data.  The  dots  shown  in  (a) 
could  be  described  as  a  series  of  straight  line  segments  at  one 
i ..solution,  as  in  (b),  but  also  be  represented  as  a  single  linear 
grouping  at  &  lower  resolution,  as  in  (c). 

lines  with  gaps)  and  it  makes  no  prior  assumptions  about 
the  amount  of  “noise”  (variation)  in  the  linear  structure. 
Since  it  tests  for  statistical  mcaningfulncss  over  all  possible 
lengths  of  an  edge,  using  a  much  more  sensitive  test  than  the 
typical  linear  inask,  it  is  possible  to  accomodate  any  amount 
of  noise  in  the  linear  description  so  long  as  the  length  of  the 
edge  is  sufficient  to  make  the  overall  linearity  statistically 
meaningful.  After  testing  all  resolutions  and  elongations,  it 
is  possible  to  select  the  resolution  and  elongation  with  the 
highest  meaningfulness  as  the  most  useful  description  of  the 
grouping.  Many  groupings  may  have  more  than  one  peak 
in  meaningfulness  at  different  resolutions  or  elongations,  as 
shown  in  Figure  3. 

Curvature  and  corners.  We  have  dealt  so  far  with 
only  straight  edges.  One  possible  extension  would  be  to 
apply  the  same  grouping  techniques  to  regions  of  the  image 
corresponding  to  all  possible  arcs  of  constant  curvature. 
However,  the  increase  in  computation  this  requires  is  rather 
large,  and  human  performance  does  not  seem  to  make  full 
use  of  the  statistical  information  over  the  length  of  a  curve 
of  constant  curvature  (however,  careful  experiments  have 
not  yet  been  carried  out  to  test  this  point).  It  seems  more 
likely  that  human  vision  only  detects  smoothness  between 
locally  linear  groupings.  In  other  words,  a  smooth  curve  con¬ 
tains  a  sequence  of  meaningful  linear  segments — the  length 
of  each  segment  depending  on  the  resolution  being  used — 
and  neighboring  linear  segments  at  slightly  different  orien¬ 
tations  are  grouped  into  a  description  of  curvature  at  that 
point.  The  allowable  range  of  orientations  for  smooth  cur¬ 
vature  depends  on  the  elongation  of  the  linear  segments, 
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Figure  4t  The  decision  as  to  whether  there  is  a  corner  (tangent 
discontinuity)  along  a  curve  can  change  at  different  resolutions 
depending  on  the  length  of  the  support  for  the  curves  on  each  side 
of  the  potential  corner.  In  (a)  there  is  a  corner  when  grouping 
at  a  low  resolution  but  a  smooth  curve  at  a  higher  resolution.  In 
(b)  the  roles  are  reversed. 

with  longer  elongations  allowing  less  change  in  orientation 
between  segments. 

The  dual  of  a  smooth-curve  continuation  from  a  seg¬ 
ment  is  the  detection  of  a  termination.  In  other  words,  if 
there  is  no  smooth  continuation  from  a  meaningful  linear 
grouping,  then  this  implies  the  detection  of  a  meaningful 
termination  of  the  curve.  A  corner  occurs  where  two  ter¬ 
minations  coincide.  The  same  set  of  points  may  have  a  ter¬ 
mination  at  one  resolution  and  may  be  grouped  as  a  smooth 
curve  at  another,  as  shown  in  Figure  4. 

Marr  [8],  Schatj  [12]  hnd  others  have  shown  the  im¬ 
portance  of  detecting  terminations  and  grouping  them  in 
further  stages.  Many  of  their  examples  of  human  texture 
discrimina  ;<u  are  best  explained  by  assuming  that  “virtual 
lines’’  are  formed  between  curve  terminations  in  the  same 
way  that  w&  have  formed  linear  groupings  of  other  points  in 
the  image.  Figure  5a  shows  an  example  of  an  edge  formed 
by  a  linear  sequence  of  terminations,  and  Schats  gives  many 
other  examples  where  grouping  of  terminations  is  necessary 
for  texture  discrimination.  Each  termination  can  be  treated 
as  a  point  and  fed  back  into  the  colinearity  detection  stage. 
This  is  one  place  where  the  multiple  descriptions  of  Figure 
4  become  important,  since  human  vision  can  detect  align¬ 
ments  of  low-resolution  corners  or  terminations,  even  when 
a  curve  is  smooth  at  a  high  resolution. 

Orientation  and  sise.  Almost  all  work  on  edge  detec¬ 
tion  within  the  computer  vision  community  has  dealt  with 
the  detection  of  edges  between  regions  of  different  intensity. 
However,  ae  Figure  5  demonstrates,  there  are  many  cases  in 
which  human  vision  detects  edges  between  regions  with  the 
same  average  intensity  but  with  properties  differing  in  other 
ways.  One  of  the  strongest  effects  is  produced  by  changes 


in  orientation,  as  shown  in  Figure  5b.  The  discrimination 
of  the  more  subtle  differences  in  Figure  5c  can  be  explained 
similarly  by  assuming  that  virtual  lines  are  constructed  be¬ 
tween  the  line  terminations  and  the  differing  orientations 
of  these  virtual  lines  are  discriminated.  Another  significant 
dimension  of  variation  is  size,  including  both  length  and 
width  of  elongated  elements.  In  Figure  5d  each  dot  in 
one  region  is  smaller  than  the  other,  although  their  num¬ 
ber  has  been  increased  to  produce  equal  image  intensity  in 
both  regions.  Figure  5e  has  the  same  number  and  size  of 
dots  in  both  regions,  but  with  different  spatial  distributions. 
Discrimination  of  this  example  can  be  explained  by  the  same 
mechanism  as  for  Figure  5d,  since  at  some  lower  resolu¬ 
tion  the  dots  in  the  central  region  will  clump  into  clusters 
with  greater  “size”  but  lower  density  than  the  other  region. 
Figure  5f  shows  the  effect  of  differing  lengths,  which  is  much 
less  pronounced. 

Size  and  orientation  parameters  need  to  be  calculated 
for  regions  at  all  resolutions  in  the  image  just  as  average 
intensity  was  calculated,  and  these  results  can  be  fed  into 
the  edge  grouping  process  in  the  same  way  as  the  inten¬ 
sity  information  was.  However,  there  are  a  number  of  im¬ 
portant  issues  to  be  resolved  in  this  process.  First  or  all, 
how  detailed  are  the  characterizations  of  the  distributions 
of  element  parameters  in  each  region?  It  seems  that  they 
are  not  very  detailed  at  all,  as  is  shown  in  an  example  by 
Riley  (reproduced  in  Marr  [8])  in  which  people  arc  unable  to 
distinguish  a  region  with  equal  numbers  or  edges  at  two  op¬ 
posite  orientations  from  a  region  with  randomly  distributed 
orientations.  This  is  another  example  of  a  highly  meaning¬ 
ful  grouping  which  human  vision  fails  to  detect.  It  appears 
that  a  single  parameter  specifying  predominant  orientation 
is  ail  that  is  required  to  explain  human  performance.  We  in¬ 
tend  to  carry  out  a  series  of  similar  experiments  to  precisely 
determine  the  characterization  that  human  vision  gives  to 
orientation  and  size  statistics.  A  second  important  issue  is 
how  to  determine  me.aningfulness  of  these  orientation  or  size 
measures.  For  example,  if  a  region  only  contains  one  element 
there  is  no  way  to  determine  whether  a  particular  orienta¬ 
tion  or  size  value  is  random  or  meaningful.  In  general,  as 
long  as  a  region  contains  at  least  two  elements  it  is  possible 
to  calculate  some  mcaningfulness  value,  and  the  tore  ele¬ 
ments  it  contains  the  higher  is  the  potential  meaningfulness. 
Symmetry  and  repetition.  Assuming  that  the  above  cal¬ 
culations  are  carried  out  in  a  fairly  complete  way,  we  conjec¬ 
ture  that  the  detection  of  symmetry  and  repetition  will  re¬ 
quire  no  further  mechanisms.  Note  that  the  linear  grouping 
of  objects  as  described  above  will  group  even  a  single  nearby 
pair  of  objects  into  a  linear  grouping,  although  the  meaning¬ 
fulness  assigned  to  a  single  pair  will  be  low.  However,  if  we 
look  for  peaks  in  the  orientation  distribution  of  each  region 
in  the  image,  then  a  number  of  low-meaningfubiess  linear 
groupings  can  have  a  high  meaningfulness  if  they  form  a 
large  peak  in  orientation.  Repetition  of  a  pattern  (including 
tiie  case  where  the  repeated  pattern  is  reflected  about  some 
axis  as  in  bilateral  or  other  symmetries)  will  result  in  many 
parallel  matches  between  similar  features  at  various  resolu¬ 
tions.  These  parallel  matches  will  not  be  significant  unless 
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Figure  5:  All  of  these  examples  have  the  same  average  intensity  over  the  differing  regions,  yet  human  vision  is  capable  of 
detecting  edges  based  on  changes  in  other  properties.  In  (a)  there  is  a  linear  alignment  of  edge  terminations  horizontally 
through  the  center  of  the  image.  Example  (b)  demonstrates  the  detection  of  a  change  in  orien!  ation,  and  (c)  demonstrates 
a  change  in  orientation  of  virtual  lines.  In  (d)  there  is  a  change  in  clement  size,  in  (e)  there  is  a  change  ill  clustering 
statistics,  and  in  (f)  there  is  a  change  in  lino  length. 


the  components  are  located  close  to  one  another  at  their 
particular  resolution;  however,  as  many  psychological  ex¬ 
periments  have  shown  (see  Julcsz  [6])  human  performance 
deteriorates  very  rapidly  as  the  distance  to  be  spanned  in 
making  these  correspondences  increases.  Given  this  conjec¬ 
ture,  we  intend  to  carry  out  other  experiments  to  test  its 
implications  for  human  performance. 

The  case  of  symmetry  points  up  the  role  of  high-level 
knowledge  in  detecting  meaningful  groupings.  Symmetry 
is  easier  to  detect  when  the  orientation  of  the  symmetry 
is  parallel  to  some  strong  local  reference  (c.g.,  the  edges  of 
the  figure  or  the  page).  For  operations  of  the  type  we  have 
described,  the  only  role  that  high-level  knowledge  can  play  is 
to  bias  the  mcaningfulness  results  calculated  by  lower  level 
operations.  This  can  lead  to  improved  performance  in  such 
tasks  as  symmetry  detection  where  the  low  level  results  are 
very  weak  to  begin  with.  However,  this  is  still  very  different 
than  the  heterarchical  approach  often  adopted  in  artificial 
intelligence,  which  is  to  calculate  only  the  easiest  results 
bottom-op  and  to  use  this  knowledge  in  a  top-down  fashion 
to  guide  computation  of  the  rest.  The  reason  this  heterar¬ 
chical  approach  fails  in  marly  cases  for  low-level  vision  is 


that  all  the  low-level  results  may  be  weak  so  that  none  of 
them  can  be  used  to  guide  computation  of  the  rest.  The  only 
solution  in  this  case  is  the  computationally  intensive  one  we 
have  adopted:  form  all  potential  groupings  bottom-up  and 
test  each  one  for  mcaningfulness  of  the  entire  combination. 
Three-dimensional  groupings.  It  has  long  been  recog¬ 
nised  that  an  important  function  of  early  vision  is  the 
derivation  or  the  three-dimensional  structure  of  the  scene. 
In  previous  papers  by  Binford  [2]  and  Lowe  and  Binfcrd 
[7|,  we  have  described  how  various  classes  of  meaningful 
alignments  in  a  monocular  image  carry  implications  for  the 
three-dimensional  structure  of  the  scene.  For  example,  if 
elements  of  the  image  are  colinear  then  they  are  also  colinear 
in  three-space,  barring  an  accident  in  viewpoint.  If  two 
edges  terminate  at  a  point  in  the  image,  or  three  or  more 
edges  converge  to  a  common  point,  then  they  must  also  ter¬ 
minate  at  a  common  point  in  three-space  unless  the  view¬ 
point  is  restrictively  aligned  to  produce  the  coincidence.  If 
one  curve  terminates  at  another  continuous  curve,  the  ter¬ 
minating  curve  cannot  be  closer  to  the  viewer  than  the  con¬ 
tinuous  curve,  or  the  termination  would  be  unlikely  to  occur 
at  that  location.  Curves  which  are  parallel  in  the  image  are 
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probably  parallel  in  three-space.  There  are  other  similar 
inferences  for  interpreting  cast  shadows  or  the  boundaries 
of  a  region.  In  all  these  cases,  the  inferences  arc  based  on 
measures  of  meaningfulncss,  where  a  meaningful  grouping 
in  <he  image  leads  to  statistical  inferences  for  the  three- 
d  ional  structure.  By  combining  the  mcanir.gfulness  of 
im.it,,  groupings  as  described  above  with  the  assumptions 
of  general  camera  position  and  general  light-source  position, 
it  is  possible  to  precisely  quantify  the  strength  of  each  in¬ 
ference. 

One  much  studied  mechanism  for  deriving  three  dimen¬ 
sional  information  is  stcrcopsis,  which  depends  on  a  general- 
purpose  mechanism  for  matching  between  two  images. 
The  groupings  we  have  described  can  serve  as  preliminary 
descriptions  for  forming  matches  between  images,  and  the 
resulting  matches  can  be  treated  as  new  groupings  with 
ineaningfulness  measures  of  their  own.  For  example,  in  per¬ 
forming  stereo  interpretation  of  random-dot  stereograms, 
at  each  resolution  about  one  element  in  10  should  form  a 
grouping  with  0.1  significance,  one  element  in  100  should 
form  a  grouping  with  0.01  significance,  etc.,  and  if  two 
images  contain  significant  features  from  the  same  class 
within  small  corresponding  fusional  areas,  it  would  be  pos¬ 
sible  to  calculate  the  meaningfulness  of  this  correspondence. 
Mayhcw  and  Frisby  [10]  describe  a  stereo  interpretation  sys¬ 
tem  that  detects  edges  by  grouping  isotropic  point  measures 
in  linear  three-space  groupings,  similar  to  the  techniques 
outlined  here  but  without  an  explicit  measure  of  meaning¬ 
fulness.  Stereo  processing  presents  yet  another  opportunity 
to  derive  highly  meaningful  larger  groupings  from  weak  lo¬ 
cal  groupings,  in  addition  to  its  role  in  providing  explicit 
depth  information. 

Implementation  of  the 
Linear  Meaningfulness  Algorithm 

In  order  to  illustrate  these  methods  in  more  detail, 
we  have  written  a  program  for  detecting  meaningful  linear 
groupings  of  points  in  an  image  at  all  resolutions,  orienta¬ 
tions,  and  elongations. 

The  program  takes  as  input  a  set  of  dots  like  those 
in  Figure  9a  (although  they  need  not  all  be  of  the  same 
size).  The  first  stage  of  the  program  accumulates  the  density 
of  points  falling  into  square  regions  at  all  resolutions  from 
t/256  the  width  of  the  image  to  1/8  the  width  of  the  image, 
each  resolution  twice  as  course  the  previous  one  (a  total  of 
6  resolutions).  Each  region  overlaps  with  its  neighbors  by 
50%  in  the  vertical  and  horizontal  directions,  so  that  each 
point  falls  into  four  of  these  regions  at  each  resolution.  We 
then  compute  the  center-surround  values  at  each  resolution, 
which  is  done  by  subtracting  the  average  intensity  of  the  sur¬ 
rounding  8  square  regions  from  the  intensity  of  each  central 
region.  This  first  stage  of  processing  in  our  implementation 
is  /ery  crude— there  should  at  least  be  smooth  transitions 
between  neighboring  regions  rather  than  abrupt  boundaries. 
However,  the  novel  part  of  the  algorithm  is  not  in  this  stage 
but  in  the  way  these  initial  isotropic  values  are  combined  to 
produce  meaningful  linear  groupings. 


Figure  6:  These  two  sets  of  points  have  the  same  standatd 
deviations  from  best-fit  lines  of  the  same  length,  yet  (b)  is  much 
more  meaningful  as  a  linear  feature  than  (a). 

One  commonly-used  method  for  measuring  the  degree 
of  linearity  among  data  points  is  to  measure  the  standard 
deviation  of  the  data  points  from  the  line  with  the  best 
least-squares  fit,  with  a  lower  standard  deviation  indicating 
a  better  fit.  However,  as  Figure  6  demonstrates,  two  groups 
of  points  with  the  same  standard  deviation  to  a  line  of  the 
same  length  can  exhibit  very  different  degrees  of  meaningful 
linearity.  That  is  the  importance  of  the  two-stage  process  we 
have  used  here,  where  isotropic  meaningfulncss  values  are 
calculated  first  and  their  effects  separated  from  the  linear 
meaningfulness  values. 

As  discussed  in  the  previous  section,  we  examine  sets 
of  isotropic  values  in  linear  arrangements  in  the  image  and 
calculate  the  probability  that  each  set  of  such  meaningful 
values  would  arise  at  random.  This  is  done  by  first  com¬ 
paring  each  isotropic  value  to  a  surrounding  set  of  similar 
values,  and  calculating  a  likelihood  for  it  based  upon  its 
rank  in  the  surround.  Then,  given  that  the  isotropic  values 
are  computed  over  independent  regions  of  the  image,  we 
multiply  their  likelihood  values  together  to  compute  the 
likelihood  that  values  with  a  mcaningfulness  at  least  that 
high  would  happen  to  occur  together.  Finally,  we  divide  this 
value  by  the  number  cf  groupings  of  that  class  attempted 
from  each  point,  since  the  more  groupings  which  are  at¬ 
tempted  the  more  coincidences  we  expect  to  find  in  a  ran¬ 
dom  distribution. 

We  start  by  computing  linear  mcaningfulness  of  just 
pairs' of  isotropic  values  and  then  combine  these  in  stages 
into  longer  elongations.  We  compute  linear  meaningfulncss 
for  pairs  at  eight  orientations  as  shown  in  Figure  7,  fol¬ 
lowing  our  rule  that  adjacent  orientations  should  overlap 
by  50%  to  minimise  the  effects  of  discretization.  We  com¬ 
pare  each  value  in  these  pairs  to  those  of  eight  surrounding 
values  in  sidebars  alongside  the  pair  as  shown  in  Figure  8. 
Mcaningfulness  is  calculated  by  taking  the  total  number  of 
surrounding  values  plus  one  and  dividing  by.  the  number 
which  arc  ranked  higher  than  the  center  value  plus  one. 
Therefore,  if  one  of  the  center  values  is  higher  than  all  8  of 
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Figure  7:  Meaningfulness  values  for  pairs  of  isotropic  values  are 
computed  at  eight  orientations,  making  use  of  the  independent 
overlaps  in  two  directions. 


Figure  81  The  meaningfulness  of  each  pair  of  isotropic  values 
is  computed  with  respect  to  eight  surrounding  values  as  shown 
in  (a).  In  (b),  a  pair  is  combined  with  two  colinear  pairs  of  the 
same  orientation  but  slightly  different  positions. 


the  surrounding  values,  it  is  assigned  a  meaningfulness  of 
(8  +  l)/(0  +  1)  =  9,  whereas  if  it  is  only  higher  than  seven 
of  them  it  is  assigned  a  value  of  (8  +  1)/(1  +  1)  =  4.5.  Note 
that  one  of  those  linear  operators  can  be  assigned  a  cer¬ 
tain  degree  of  meaningfulness  just  from  a  single  meaningful 
isotropic  value-  this  is  necessary  so  that  combinations  into 
longer  elongations  will  have  some  measure  by  which  to  com¬ 
pute  the  meaningfulness  of  the  combination  even  if  there  is 
incomplete  evidence  for  linearity  in  the  shorter  operator. 

Each  of  these  linear  operators  is  combined  with  its 
neighboring  colinear  operators  to  produce  operators  with 
longer  elongations.  Each  stage  of  combination  produces 
elongations  which  are  twice  as  long  and  therefore  need  to 
be  tested  at  twice  as  many  orientations  to  cover  the  space 
of  possibilities  with  the  same  degree  of  overlap.  Therefore, 
each  linear  operator  is  combined  with  each  of  two  operators 
with  the  same  orientation  but  slightly  different  positions 
perpendicular  to  the  direction  of  linearity — to  approximate 
linear  operators  of  different  orientations — as  is  shown  in 
Figure  8b. 

The  results  of  carrying  out  these  computations  on  an 
image  of  dots  is  shown  in  Figure  10.  Figure  9a  is  the  original 
input  to  the  program,  which  contains  a  linear  feature  im¬ 
mediately  apparent  to  human  vision  but  which  nonethe¬ 
less  is  based  on  weak  local  evidence.  Figures  10a  though 
10c  show  the  results  or  computing  linear  meaningfulness  at 
three  different  resolutions,  each  one  twice  as  course  as  the 
previous  ones.  Each  line  shows  position  and  length  of  a 
linear  operator  and  the  circles  at  the  end  of  each  line  are 
proportional  to  the  log  of  the  likelihood  computed  for  that 
operator.  The  large  amount  of  output  in  these  figures  in¬ 
cludes  many  linear  groupings  which  are  only  of  marginal 
meaningfulness,  although  data  of  this  sort  would  be  neces¬ 
sary  in  many  situations  for  grouping  on  the  basis  of  orien- 


Figure  10.  Figure*  (a)  through  (e)  show  the  results  of  computing  meaningful  linearities  at  three  different  resolutions 
(of  increasing  powers  of  2)  on  the  data  of  Figure  8.  The  lines  represent  linear  groupings  at  various  elongations,  and  the 
circles  at  the  end  of  ea.h  line  are  proportional  to  the  log  of  the  likelihood  value.  If  we  reduce  the  significance  threshold 
to  the  0.01  level,  we  are  left  with  the  results  displayed  in  (d)  through  (f). 


tation  or  other  higher  level  grouping.  However,  when  we 
reduce  the  meaningfulness  threshold  fer  display  to  the  0.01 
significance  level,  we  get  the  results  shown  in  Figures  lOd 
through  lOf,  which  give  only  linear  groupings  of  elements 
corresponding  to  the  single  prominent  diagonal  line. 

This  initial  demonstration  is  quite  crude  in  many 
respects,  and  there  are  many  issues  remaining  to  be  resolved. 
Nevertheless,  there  are  some  ways  in  which  this  example  has 
impressive  performance.  As  shown  in  Figure  9b,  if  we  ex¬ 
amine  small  regions  of  the  input  data  there  is  insufficient 
information  to  reliably  detect  a  linear  arrangement  of  dots. 
Therefore,  the  program  has  in  some  sense  combined  the 
statistical  information  from  along  the  full  length  of  the 
feature  before  arriving  at  its  conclusion  of  meaningful¬ 
ness.  This  is  a  simple  example  of  the  detection  of  globally 
significant  features  from  locally  weak  information  that  was 
discussed  earlier.  Also,  unlike  most  other  edge  detection 
programs,  it  has  made  no  assumptions  about  the  amount 
of  “noise”  (deviation  of  the  dots  from  a  straight  line),  and 
would  have  delected  the  grouping  over  a  very  wide  range 
of  sizes  in  the  image.  Unlike  linear  convolutions,  it  is  rela¬ 
tively  insensitive  to  a  few  large  dots  placed  near  the  linear 
grouping  in  the  image,  since  it  uses  a  iion-paramctric  rank 
test  rather  than  making  assumptions  about  the  surrounding 
distribution.  This  approach  is  also  very  different  than  the 
simple  detection  of  zero  crossings,  since  zero  crossings  are  a 
local  measure  that  may  not  he  supported  by  any  meaningful 
information  (for  example,  zero  crossings  may  wander  ran¬ 
domly  under  the  influence  of  sensor  noise  in  regions  of  the 
image  which  do  not  have  some  sufficiently  strong  changes  in 
intensity). 

Computation  Time  Considerations 

The  methods  discussed  in  this  paper— testing  group¬ 
ings  for  meaningfulness  at  all  possible  positions,  resolutions, 
and  parameter  values  -arc  more  computationally  expensive 
than  those  used  in  most  current  computer  vision  programs. 
However,  the  difference  is  only  a  moderate  linear  increase 
over  other  methods  and  is  not  as  large  is  might  be  feared 
at  first.  For  example,  examining  the  image  at  6  reso!'  '.ions 
can  be  less  than  a  factor  of  2  more  expensive  than  examin¬ 
ing  it  at  the  finest  resolution,  since  halving  the  resolution 
meafth  that  only  one  quarter  as  many  groupings  need  to  be 
examined  over  the  area  of  the  image.  When  using  a  digi¬ 
tal  computer  there  arc  techniques  that  allow  us  to  not  even 
consider  groupings  that  do  not  have  any  potentially  mean¬ 
ingful  constituents.  The  program  described  in  the  previous 
section  used  hash  coding  to  access  each  grouping  (each  dot 
was  hashed  into  all  groupings  which  contained  it),  so  that 
groupings  for  positions  in  the  image  which  did  not  contain 
any  dots  were  not  even  considered. 

However,  regardless  of  these  implementation  considera¬ 
tions,  it  seems  likely  that  there  is  no  alternative  to  examin¬ 
ing  these  large  numbers  of  groupings  if  machine  vision  is 
to  rival  human  visual  capabilities.  As  emphasized  earlier, 
there  may  be  no  information  available  at  an  early  stage  to 
indicate  which  of  a  large  number  of  possible  groupings  will 
turn  out  to  be  meaningful.  From  what  we  know  of  the 


human  visual  system,  it  seems  that  our  brains  have  opted 
for  a  brute-force,  parallel  approach  to  carrying  out  these 
computations.  A  surprisingly  large  proportion  of  the  human 
brain  seems  to  be  devoted  to  simply  computing  local  results 
over  all  positions  in  visual  images. 

Summary 

Our  derivation  of  the  function  of  early  vision  docs  not 
start  from  a  specific  model  of  what  we  expect  to  find  in  the 
world.  Given  that  the  world  is  very  general  and  variable, 
any  prior  knowledge  about  its  structure  at  this  level  would 
probably  be  rather  weak.  Instead  we  see  the  task  of  early 
vision  to  be  the  formation  of  meaningful  groupings  in  the 
image,  where  meaningfulness  can  be  tested  in  a  domain- 
independent  self-verifying  way.  It  is  possible  to  approach 
this  task  as  a  signal  detection  problem  which  makes  maxi¬ 
mum  possible  use  of  the  statistical  information  in  the  image 
to  distinguish  significant  relationships  from  the  background 
of  accidentals. 

Therefore,  meaningfulness  docs  not  depend  on  prior 
world  knowledge.  For  example,  we  emphasize  the  detection 
of  linearity  as  one  useful  basis  not  because  it  is  a  common 
structure  in  the  world  so  much  as  because  it  is  a  particularly 
simple  basis,  where  simplicity  is  required  to  make  maximum 
use  of  the  available  information.  Therefore,  linearity  is  a 
useful  way  to  segment,  and  describe  random  fields  (for  the 
purpose,  say,  of  comparing  them  to  similar  random  fields) 
even  though  the  fields  were  generated  without  respect  to  any 
linearities.  Although  we  cannot  claim  to  have  fully  done  so 
in  this  paper,  it  seems  possible  to  base  this  method  on  a 
sound  mathematical  derivation. 

One  result  of  the  domain-independence  assumption  is 
that  control  issues  become  less  important.  We  compute 
all  possibilities  in  a  bottom-up  exhaustive  manner,  and 
have  given  reasons  why  there  may  be  no  computationally 
less- expensive  way  to  achieve  these  results.  For  example, 
there  is  probably  nothing  to  gain  from  the  often  mentioned 
technique  of  detecting  the  strong  parts  of  an  edge  and 
“extending”  these  segments  to  look  for  weaker  segments, 
since  we  also  want  to  be  able  to  detect  the  edge  when  all 
pa'*-,  are  locally  weak.  Further  progress  in  low-level  per¬ 
ception  is  more  likely  to  come  from  attempts  to  precisely 
specify  what  we  want  to  measure  than  from  improvements 
in  control  of  the  computa.ion. 

Of  course,  the  above  is  not  meant  to  imply  that  all 
aspects  of  early  vision  can  be  derived  from  a  few  abstract 
principles.  Fortunately,  we  are  working  in  an  area  in  which 
it  is  comparatively  easy  to  perform  experiments  to  test  the 
capabilities  and  parameters  of  the  human  visual  system. 
The  measurement  of  meaningfulness  we  have  described  is 
a  theory  that  can  be  subjected  to  empirical  tests  and  can 
be  used  to  guide  experimentation.  One  encouraging  result 
is  that  a  few  fairly  simple  computations  seem  to  cover  what 
initially  appear  to  be  a  very  diverse  set  of  capabilities  There 
are  numerous  avenues  along  which  to  pursue  further  re¬ 
search. 
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U  Abstract 

We  describe  the  current  state  ot  the  30  Mosaic  project,  whose 
goal  is  to  incrementally  acquire  a  3D  model  ot  a  complex  urban 
scene  from  images.  The  notion  ot  incremental  acquisition  arises 
from  the  observations  that  (1)  single  images  contain  only  partial 
information  about  a  scene,  (2)  complex  images  are  ditticult  to  fully 
interpret,  and  (3)  different  features  of  a  given  scene  tend  to  be 
easier  to  extract  in  different  images  because  of  differences  in 
viewpoint  and  lighting  conditions.  In  our  approach,  multiple  images 
of  the  scene  are  sequentially  analyzed  so  as  to  incrementally 
construct  the  model.  Each  new  image  provides  information  which 
refines  the  model.  We  describe  some  experiments  toward  this  end. 
Our  method  of  extracting  3D  shape  information  from  the  images  is 
stereo  analysis.  Because  we  are  dealing  with  urban  scenes,  a 
junction-based  matching  technique  proves  very  useful.  This 
technique  produces  rather  sparse  wire  frame  descriptions  ot  the 
scene.  A  reasoning  system  that  relies  on  task  specific  knowledge 
generates  an  approximate  model  of  the  scene  irom  the  stereo 
output.  Gray  scale  information  is  also  acquired  lor  the  faces  in  the 
model.  Finally,  we  describe  an  experiment  in  combining  two  views 
of  the  scene  to  obtain  a  refined  model. 

1 .  Introduction 

The  goal  ot  the  3D  Mosaic  project  is  to  automatically  acquire  a 
detailed  30  description  (or  model)  of  a  complex  urban  scene  from 
images.  We  are  currently  working  with  aerial  photographs  of 
Washington,  0.  C.  Fig.  1  shows  a  stereo  pair  of  images  from  our 
database. 

Our  approach  to  this  problem  is  based  on  the  notion  of 
incremental  acquisition  of  the  scene  model.  A  single  image  or  view 
of  a  complex  scene  is  generally  not  adequate  tor  deriving  a 
complete,  accurate  description  of  the  scene.  Some  reasons  for  this 

are: 


1 .  Many  surfaces  in  the  scene  are  occluded  in  any 
particular  view. 

2.  Because  of  the  complexity  of  an  image,  it  would  be 
difficult  to  interpret  all  the  detailed  parts. 

3.  Some  characteristics  of  visible  surfaces  may  not  be  as 
apparent  in  one  image  as  in  a  different  image.  For 
example,  it  may  be  difficult  to  analyze  a  highly  oblique 
surface  because  of  lack  of  resolution  in  the  image,  or  it 
may  be  difficult  to  analyze  surfaces  with  shadows  cast 
across  them. 

4.  Errors  in  analyzing  and  interpreting  the  image  may 
create  errors  and  inconsistencies  in  the  scene 
description. 

Our  method  involves  using  multiple  views  of  the  scene  in  a 
sequential  manner.  A  partial  description  is  derived  from  each  view. 
As  each  successive  view  is  analyzed,  the  model  of  the  scene  is 
incrementally  updated  with  information  derived  from  the  view.  The 
model  is  initially  an  approximation  of  the  scene,  and  becomes  more 
and  more  refined  as  new  views  are  processed.  At  any  point  along 
its  development,  the  model  should  be  usable  for  the  following  types 
of  tasks: 

1 .  When  information  is  derived  from  a  new  view,  it  must  be 
matched  to  the  model  so  that  updating  can  occur.  The 
model  should,  therefore,  contain  information  that 
facilitates  this  matching. 

2.  The  model  should  permit  higher-level  components  to 
determine  which  parts  of  the  scene  should  be  analyzed 
in  more  detail,  and  whether  a  different  view  is  required 
for  further  analysis  of  these  parts- 

3.  The  model  should  be  usable  in  its  task  domain,  e.g.  for 
photointerpretation  or  display  generation. 

In  our  approach,  3D  features  that  are  relatively  inexpensive  to 
obtain,  such  as  certain  comers  of  buildings,  are  extracted  from  the 
images.  A  model  of  the  scene  is  then  hypothesized  from  these 
features  by  utilizing  task-specific  knowledge  (e.g.  block-shaped 
objects  in  an  urban  scene).  Updating  and  refinement  of  the  model 


-  was 


179 


is  facilitated  by  remembering  which  parts  of  the  model  have  been 
hypothesized  and  which  parts  have  been  directly  derived  from  the 

images. 

There  are  several  applications  we  have  in  mind  for  the  types  of 
models  that  are  acquired.  The  first  involves  model-based 
photointerpretation.  A  scene  model  can  provide  significant  help  in 
interpreting  images  of  the  scene  taken  from  arbitrary  viewpoints 
[4, 15].  Furthermore,  the  analysis  results  can  be  used  in  the 
incremental  acquisition  loop  to  update  and  refine  the  model. 
Anothe  area  of  application  deals  with  generating  flight  plans 
(simulating  the  appearance  of  the  scene  along  potential  flight 
paths)  or  familiarizing  personnel  with  a  given  area.  Cur  methods 
provide  the  ability  to  acquire  a  model  of  a  scene  from  only  a  few 
views  and  then  generate  arbitrary  views  from  the  model.  Finally, 
our  incremental  3D  Mosaic  approach  should  be  applicable  to 
robot  navigation  and  manipulation  tasks.  The  ability  to 
incrementally  acquire  approximate  descriptions  of  complex 
environments  could  prove  useful  for  these  tasks,  since  these 
descriptions  may  then  be  used  to  make  decisions  dealing  with  path 
planning  or  determining  which  parts  of  the  environment  to  analyze 
in  more  detail. 

In  the  rest  of  this  paper,  we  first  discuss  how  to  extract  a  30 
scene  description  from  a  single  view.  The  stereo  pair  of  images 
shown  in  Fig.1  constitutes  the  single  view  to  be  considered. 
Afterward,  we  discuss  combining  information  from  multiple  views. 

2.  Stereo  Analysis 

Our  current  method  of  extracting  30  shape  information  from  the 
images  is  via  stereo  analysis.  In  the  future,  we  may  add  other 
methods,  such  as  shadow  analysis  [16]. 

Our  approach  to  the  stereo  matching  problem  is  to  match 
junctions  and  fines  found  in  the  images.  There  ve  several  reasons 
for  this: 

1.  Our  goal  is  to  recover  the  30  structure  in  the  scene.  We 
approach  this  problem  by  first  extracting  30 
information  dealing  with  vertices  and  edges  in  the 
scene-  In  an  urban  scene,  vertices  often  correspond  to 
comers  of  buildings.  Therefore,  by  recovering  scene 
vortices  and  edges  that  emanate  from  them,  we  obtain 
portions  of  boundaries  of  the  buildings.  These 
boundaries  then  allow  us  to  construct  30 
^jproximebons  of  the  buildings.  (See  [10]  for  a 
dHfarent  approach  developed  for  the  same  task 
domain.) 


2.  Our  stereo  images  are  fairly  wide  angle  and  the  scene 
consists  of  tall  buildings.  As  a  result,  there  are  large 
discontinuities  in  disparity  and  the  appearance  of  many 
objects  differ  significantly  in  the  two  images.  This  has 
caused  problems  for  most  previous  stereo  matching 
techniques  since  there  are  large  portions  of  the  scene 
that  are  visible  in  one  image  but  not  in  the  other.  In  our 
approach,  we  are  not  interested  in  matching  scene 
faces  that  are  occluded  in  one  ot  the  image  pairs. 
Rather,  our  goal  is  to  match  face  boundaries  that  are 
visible  in  both  images.  We  do  this  by  explicitly  taking 
into  account  the  way  junctions  change  from  one  image 
to  the  other.  We  find  a  junction  in  one  image,  use  task- 
specific  constraints  to  predict  its  appearance  in  the 
other  image,  and  search  for  the  corresponding  junction 
by  making  use  of  the  predicted  appearance.  Currently, 
our  method  of  predicting  junction  appearances  is 
based  on  the  followinq  task-specific  knowledge: 

a.  In  aerial  photography,  image  planes  tend  to  be 
almost  parallel  to  the  ground  plane. 

b.  In  urban  scenes,  roofs  of  buildings  tend  to  be 
almost  parallel  to  the  ground  plane,  while  walls 
tend  to  be  perpendicular  to  this  plane. 

C.  Therefore,  features  lying  on  a  roof  or  on  the 
ground  will  maintain  the  same  shape  in  both 
images.  Edges  in  the  scene  that  are 
perpendicular  to  the  ground  plane  will  appear  in 
each  image  to  be  directed  toward  the  origin, 
defined  by  the  intersection  of  the  camera  axis 
with  the  image  plane  [11]. 

If  an  L  junction  is  found  in  one  image,  it  is  initially 
assumed  to  arise  from  a  comer  of  a  roof,  and  thus  its 
appearance  in  the  other  image  can  be  predicted.  If  an 
ARROW  or  FORK  junction  is  found,  the  line  directed 
toward  the  origin  is  initially  assumed  to  arise  from  a 
scene  edge  which  is  perpendicular  to  the  ground,  while 
the  other  two  lines  of  the  junction  are  initially  assumed 
to  arise  from  scene  edges  lying  on  a  roof  or  on  the 
ground.  Again,  its  appearance  can  be  predicted. 

3.  Many  stereo  systems  have  trouble  with  wide  angle 
stereo  images  because  they  rely  heavily  on  local 
similarities  in  the  two  images  [2, 3.  9, 12, 13].  In  our 
approach,  however,  because  the  junction  is  intended  to 
represent  a  structural  component  in  the  scene,  we  also 
rely  on  more  global,  structural  similarities  in  the  two 
images  to  perform  the  matching. 

4.  For  a  scene  with  many  occlusion  boundaries,  an 
approach  based  on  feature  matching  results  in  much 
more  accurate  3D  positions  for  these  boundaries  than 
an  approach  based  on  gray  scale  area  matching. 

2.1 .  Steps  In  Stereo  Analysis 

Ex  ractlna  lines.  The  first  step  In  the  stereo  analysis  is  to  extract 
linear  features.  A  3x3  Sobel  operator  is  used  to  extract  edge  points, 
as  shown  in  Fig.  2.  Then  the  edges  are  thinned  using  a  modified 
Nevatia  and  Babu  algorithm  [14],  as  shown  in  Fig.  3  The  resulting 
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edge  points  are  linked  and  straight  lines  are  fitted  to  them.  The 
method  used  to  tit  straight  lines  to  a  set  ol  linked  points  is  based  on 
iterative  end-point  fitting  |7].  However,  since  this  method 
determines  a  line  using  only  two  end  points,  the  line  equation  lor 
the  set  of  points  is  recalculated  using  least  squares.  Finally,  short 
lines  are  discarded.  The  resulting  line  images  are  shown  in  Fig.  4. 

Extracting  junctions.  The  next  step  is  to  extract  junctions  from 
the  line  images.  A  junction  is  a  group  of  lines  that  meet  at  a  point, 
and  often  arises  from  a  vertex  in  the  scene.  We  consider  the 
following  four  junction  types:  L,  ARROW,  FORK,  and  T.  To  find 
junctions,  a  5x5  window  around  each  end  point  of  each  line  is 
searched  for  ends  of  other  lines.  Lines  in  the  window  that  are  close 
and  nearly  parallel  are  combined  into  a  single  line.  Then,  if  the 
window  contains  the  ends  of  three  lines,  the  lines  are  classified  as 
an  ARROW,  FORK,  or  T  junction  depending  on  the  angles  between 
the  lines.  The  position  of  the  junction  point  is  the  middle  of  the 
three  end  points.  If  a  window  contains  the  ends  of  two  Ikies,  the 
lines  are  classified  as  an  L  junction.  The  intersection  of  the  two 
lines  determines  the  position  of  the  junction  point.  If  a  window 
contains  more  than  three  lines,  each  set  of  two  lines  is  assumed  to 
form  a  distinct  L  junction.  Junctions  that  have  been  found  in  this 
manner  are  labeled  in  Tig.  4. 

Find  potential  junction  matches.  The  next  step  in  the  stereo 
analysis  is  to  match  the  junctions  found  in  one  image  with  those  in 
the  other.  Let  us  consider  how  L  junctions  are  matched.  As 
explained  previously,  each  L  junction  in  one  image  is  initially 
assumed  to  lie  on  a  plane  which  is  almost  parallel  to  the  camera 
image  planes.  The  shape  and  orientation  of  its  corresponding 
junction  in  the  other  image,  therefore,  can  be  predicted.  Each  L 
junction  in  the  first  image  may  be  matched  with  several  junctions  in 
the  second  image  that  lie  along  the  corresponding  epi polar  fine  and 
that  have,  within  tolerance,  the  predicted  shape  and  orientation.  An 
interesting  point  is  that  we  do  not  try  to  match  only  with  junctions  in 
the  second  image  that  have  been  previously  found.  Rather,  the 
shape  and  orientation  of  the  corresponding  junction  in  the  second 
image  is  predicted  for  every  point  lying  on  the  epipolar  line  (on  (he 
appropriate  side  of  the  infinity  point),  and  at  each  of  these  points,  a 

search  is  made  within  a  pre-spec  if  ied  window  for  lines  that  might 
correspond  to  the  predicted  junction.  The  requirements,  however, 
for  two  lines  to  be  a  junction  is  more  relaxed  than  the  requirements 
during  initial  junction  search.  We  therefore  improve  feature 
detection  in  each  image  by  using  the  features  found  In  one  image  to 
predict  features  in  the  other  image.  (Matching  is  performed  In  two 
directions,  from  the  first  image  to  the  second,  and  vice  versa.) 


To  match  ARROW,  FORK,  and  T  junctions,  each  pair  of  lines 
forming  the  junction  is  treated  as  if  it  were  an  L  junction  and 
matched  in  the  manner  described  above. 

Search  lap  unique  junction  matches.  Next,  a  beam  search 
[15]  is  used  to  arrive  at  a  unique  combination  of  junction  matches. 
There  are  two  factors  involved  in  computing  costs  for  the  various 
combinations  of  matches: 

1 .  Local  cost  between  two  potentially  matching  junctions 
is  computed  by  the  similarity  of  the  image  intensities 
inside  the  junctions.  The  assumption  here  is  that  the 
two  junctions  will  have  simitar  intensities  if  they  arise 
from  the  same  face  comer. 

2.  Global  cost  is  based  on  the  consideration  that  if  there 
are  two  vertices  in  the  scene  with  the  same  heights,  the 
positional  relationship  between  their  corresponding 
junctions  in  one  image  will  be  the  same  as  in  the  other 
image.  This  is  due  to  the  image  planes  being  parallel  to 
the  ground  plane.  We  make  the  assumption  that 
junctions  which  are  close  to  one  another  will  often 
correspond  to  vertices  lying  on  top  of  the  same 
building,  thus  having  approximately  the  same  height. 

Global  cost  between  two  potentially  matching  junctions 
is  therefore  computed  by  the  similarity  of  the 
configuration  of  the  neighborhoods  around  the 
junctions. 

The  matching  procedure  is  applied  from  the  first  image  to  the 
second  and  vice  versa.  The  results  are  then  merged.  Fig.  5  shows 
junctions  and  lines  in  one  image  that  have  matches  in  the  other 
image. 

Sccrch  iO£  third  lens  g|  junctions.  The  next  step  is  to  find 
lines  in  the  images  that  might  be  the  third  leg  of  matched  junctions 
and  that  might  represent  scene  edges  perpendicular  to  the  ground 
plane.  The  method  used  is  to  find  lines  near  the  junctions  in  both 
images  that  are  directed  toward  the  origin. 

Generate  iU  wire  frames.  Finally,  30  coordinates  are  derived 
using  triangulation.  Fig.  6  shows  a  perspective  view  of  the  30 
vertices  and  edges  that  result.  We  call  this  a  wire  frame  description 
of  the  scene. 

3.  Representing  and  Modifying  the  3D 
Scene  Model 

Our  requirement  that  the  scene  model  is  to  be  incremenWIy 
acquired  leads  to  several  issues:  (1 )  representing  partial  constraints 
on  30  structure,  (2)  incremental  accumulation  of  these  partial 
constraints,  and  (3)  handling  discrepancies  in  information  acquired 
at  different  times. 
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Our  approach  involves  representing  the  model  in  a  modular 
manner.  Constraints  on  3D  structure  are  represented  in  the  form  of 
a  graph,  called  the  structure  graph.  The  nodes  and  links  represent 
primitive  topological  and  geometric  constraints.  The  structure 
graph  is  incrementally  constructed  through  the  addition  of  these 
constraints.  As  constraints  are  accumulated  in  the  graph,  their 
effects  are  propagated  to  other  parts  of  the  graph  so  as  to  obtain 
globally  consistent  interpretations. 

3.1.  Representation  of  Model 

The  current  structure-graph  representation  models  surfaces  in 

the  scene  as  polvhedra.  The  components  of  a  polyhedral  surface 

are  the  face,  edge,  and  vertex.  We  distinguish  the  topology  of  tne 

polyhedral  components  from  their  geometry  [1,8].  The  geometry 

involves  the  physical  dimensions  and  location  in  3  space  of  each 

component,  while  the  topology  involves  connections  between  the 
components. 

In  the  structure  graph,  nodes  represent  either  primitive 
topological  elements  -  faces,  edges,  vertices,  objects,  and  edge 
groups  (which  are  rings  of  edges  on  faces)  -  or  primitive  geometric 
elements  ••  planes,  lines,  and  points.  Vertex,  face,  and  edge  nodes 
are  tagged  as  either  confirmed  or  unconfirmed.  Confirmed  means 
that  the  element  represented  by  the  node  has  been  derived  directly 
from  the  images.  Unconfirmed  means  that  the  element  has  only 
been  hypothesized. 

The  primitive  geometric  elements  serve  to  constrain  the  3-space 
locations  of  faces,  edges,  and  vertices.  Plane  and  line  nodes 
contain  plane  and  line  equations,  respectively.  Point  nodes  contain 
coordinate  values.  The  graph  contains  two  types  of  links:  the 
part-ot  link,  representing  the  part/whole  relation  between  two 
topological  nodes,  and  the  geometric  constraint  link,  representing 
the  constraint  relation  between  a  geometric  and  topological  node. 

3.2.  Modifications  to  Model 

Modifications  to  the  model  will  occur  as  part  of  the  process  of 
incremental  construction.  Deletions  and  changes  are  made  when 
new  information  is  found  to  conflict  with  information  currently 
contained  in  the  model.  This  will  happen  most  often  with  portions 
of  the  model  that  have  been  hypothesized.  Additions  to  the  model 
are  made  to  incorporate  the  new  information  as  part  of  the  model. 

Modifications  to  the  structure  graph  are  made  by  adding  or 
deleting  nodes  and  links,  or  changing  the  equations  of  line  and 
plane  nodes,  or  the  coordinates  of  point  nodes.  All  effects  of 


modifications  are  propagated  to  other  parts  of  the  graph. 

As  tin  example,  consider  adding  or  deleting  a  geometric 
constraint  link  between  a  geometric  and  topological  node.  Any  of 
the  three  geometric  nodes  -  points,  lines,  and  planes  -  may 
constrain  any  of  the  three  topological  nodes  -  vertices,  edges,  and 
faces.  Fig.  7  shows  how  a  constrainl  on  one  node  may  propagate 
to  others.  The  arrows  in  the  figure  indicate  the  direction  of 
propagation.  For  example,  if  a  point  constrains  a  vertex,  it  must 
also  constrain  all  edges  and  faces  containing  that  vertex.  Similarly, 
a  point  that  constrains  an  edge  also  constrains  all  faces  containing 
that  edge. 

When  a  geometric  constraint  link  is  deleted,  the  rest  of  the 
structure  graph  must  be  made  consistent  with  this  change.  Our 
approach  to  this  problem  is  based  on  the  TMS  system  [6],  using  the 
notion  that  when  an  assertion  is  deleted,  all  assertions',  implying  it 
and  all  assertions  implied  by  it  should  also  be  deleted,  unless  they 
have  other  support.  Assertions  that  imply  a  given  assertion  are 
obtained  by  following  backwards  along  the  arrows  in  Fig.  7. 
Assertions  implied  by  a  given  assertion  involve  following  forward 
along  the  arrows. 

Consider  the  example  in  Fig.  8a,  which  depicts  three  topological 
nodes  (vertex  v,  edge  e,  face  f)  constrained  by  one  geometric  node 
(point  p).  Suppose  now  that  link  4  is  deleted  (Fig.  8b),  i.e.,the 
assertion  "p  constrains  e"  is  deleted.  To  find  the  assertion  that 
might  imply  this  one,  locate  the  box  in  Fig.  7  that  represents  a  point 
constraining  an  edge,  follow  backwards  along  the  arrow,  and  the 
result  is  the  box  that  represents  the  point  constraining  any  vertex  of 
the  edge.  In  Fig.  8b,  this  represents  the  assertion  "p  constrains  v, 
and  v  is  part  of  e".  This  assertion  must  therefore  be  made  false.  To 
do  so.  we  may  delete  either  link  1,  link  3.  or  both  from  Fig.  8b.  We 
have  arbitrarily  decided  that  part  of  links  should  dominate 
constraint  links,  and  thus  link  3  is  deleted.  This  seems  to  work  well 
for  our  examples. 

We  now  must  determine  the  assertions  implied  by  the  one  initially 
deleted.  We  follow  forward  along  the  arrow  from  the  box  in  Fig. 

7  that  represents  a  point  constraining  an  edge,  and  the  result  is  the 
box  that  represents  the  point  constraining  all  faces  containing  the 
edge.  In  Fiq.  8b,  this  represents  the  assertion  "p  constrains  f", 
which  is  link  5.  This  link  should  therefore  be  deleted  because  it  has 
no  other  support.  The  resulting  structure  graph  is  depicted  in  Fig. 
8C. 
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4.  Generating  the  3D  Scene  Model 

We  now  present  an  example  showing  how  the  scene  model  is 
generated  from  the  output  of  the  stereo  analysis  component.  We 
start  with  the  30  wire-frame  description  shown  in  Fig.  9.  The  final 
model  derived  is  a  surface-based  description. 

Combine  edges.  First,  if  there  are  two  wire  frame  edges  that  are 
nearly  parallel  and  very  close  to  each  other,  they  are  merged  into  a 
single  edge.  This  occurs  only  once  in  Fig.  9,  for  the  two  edges 
labeled  El  and  E2. 

Generate  web  faces.  Next,  each  vertex  is  assumed  to 
correspond  to  a  comer  of  an  object.  Therefore  each  adjacent  pair 
of  legs  ordered  around  the  vertex  corresponds  to  the  comer  of  a 
planar  face.  Thus  far  in  our  experiments,  we  have  dealt  only  with 
trihedral  vertices.  In  this  case,  every  pci'  of  legs  of  each  vertex 
corresponds  to  the  comer  of  a  separate  face.  A  partial  face,  called 
a  web  /tee,  is  generated  for  each  such  pair. 

Merge  partial  laces.  After  all  web  faces  have  been  created, 
those  that  represent  the  corners  of  a  single  face  are  merged.  Two 
partial  faces  that  contact  each  other  (e.g.  FI  and  F2  in  Fig.  9) 
should  be  if  (1)  they  share  exactly  one  edge,  (2)  the  edge 

serves  as  a  boundary  of  both  faces,  but  does  not  partition  them, 
and  (3)  the  planes  of  the  faces  are  nearly  parallel  and  very  dose  to 
each  other. 

Two  partial  faces  that  do  not  contact  each  other  (e.g.  F3  and  F4 
in  Fig.  9)  should  be  merged  if  (1)  each  face  has  a  single  chain  of 
edges  that  is  not  closed,  (2)  each  of  the  two  end  points  of  the  edge 
chain  of  one  face  must  be  uniquely  matched  with  those  of  the  other 
face,  where  unique  matching  is  determined  by  the  distance 
between  the  two  points  being  less  than  a  threshold,  and  (3)  the 
planes  of  the  faces  are  nearly  parallel  and  very  close.  When 
merging  the  two  non-contacting  faces,  the  two  edges  on  which 
each  matching  pair  of  end  points  lie  are  extended  in  space  and 
intersected.  The  intersection  points  form  two  new  vertices  on  the 
resulting  race. 

Complete  the  shapes  gf  faces.  Alter  all  mergers  have  been 
performed,  many  facet  may  stHI  be  incomplete,  i.e.,  they  do  not 
have  a  dosed  boundary.  In  thee*  cases,  task-specific  knowledge  is 
used  to  hypothesize  the  shape  of  each  face,  and  it  la  completed  by 
generating  the  ypropriats  edges  and  vertices.  The  rules  used 
hare  are: 


1.  If  the  partial  face  consists  of  a  single  comer,  I.e.,  it 
contains  only  two  connected  edges,  the  shape  is 
completed  as  a  parallelogram. 

2.  If  the  partial  face  contains  three  or  more  edges 
connected  as  a  single  chain,  the  shape  is  completed  by 
connecting  the  two  end  points  of  the  chain  with  a  new 
edge. 

find  holes  in  Ids  faces.  After  all  faces  have  been  completed, 
one  face  is  assumed  to  represent  a  hole  in  another  face  if  (1)  the 
planes  of  the  faces  are  nearly  parallel  and  dose  to  each  other,  and 
(2)  the  boundary  of  the  first  face,  when  projected  onto  the  plane  of 
the  second  face,  falls  inside  the  boundary  of  that  face.  When  these 
conditions  are  met,  the  bounding  edges  of  the  first  face  are 
converted  into  an  inner  ring  of  edges  of  the  second  face. 

Generate  vertical  lasss.  lot  incomplete  objects.  At  this 
point,  many  objects  will  be  only  partially  complete  because  they  are 
not  dosed.  Task-specific  knowledge  may  be  used  to  add  more 
faces  to  the  object.  Because  our  30  information  is  obtained  from 
aerial  images  of  an  urban  scene,  many  faces  that  lie  high  enough 
above  the  ground  plane  represent  roofs  of  buildings.  For  each 
such  face,  vertical  walls  are  dropped  toward  the  ground  plane  from 
each  edge  of  the  face,  unless  the  edge  is  also  part  of  another  ,ace 
The  equation  of  the  ground  plane  is  currently  interactively  obtained. 
A  vertical  wall  is  dropped  either  down  to  the  ground  olane,  or  to  the 
first  face  it  intersects  on  the  way  down. 

Our  procedure  for  dropping  vertical  faces  from  a  face  F  is  as 
follows.  First,  an  edge  is  dropped  from  each  vertex  of  F  either  to 
the  ground  plane  or  to  the  first  face  it  intersects.  Next,  web  faces 
are  created  for  each  new  edge  pair  at  each  vertex.  Newly  creeled 
faces  are  then  merged  and  completed  in  the  ways  described  above 
Fig.  10  shows  several  perspective  views  of  the  resulting  scene 
model. 

4.1.  Comparison  with  Depth  Map 

There  are  several  interesting  points  about  the  generated  model. 
First,  notice  that  K  is  a  higher  level  description  than  a  depth  map. 
The  product  of  most  stereo  analysis  systems  is  a  depth  map  [2, 13) 
which,  ithe  an  Imago,  is  an  array  of  numbers  that  requires 
description.  Our  approach,  on  tbs  other  hand,  has  been  to  extract 
a  sparse  amount  of  3D  information  using  stereo  analysis  (as  shown 
in  Fig.  0)  and  to  use  task-specific  knowledge  to  go  directly  to  a 
higher  level  30  description.  This  description  is  much  mors 
compact  than  one  based  on  surface  points,  ant*  shows  properties 
such  as  topology,  shaps,  absolute  size,  and  absolute  position  of 
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scene  objects  to  be  easily  available.  It  should  therefore  be  easier  to 
update  and  refine  the  model  from  information  obtained  from 
subsequent  views.  Furthermore,  the  model  should  be  more  useful 
for  matching  with  20  image  information,  with  30  information 
extracted  from  images,  and  with  other  models. 

4.2.  Mapping  Gray  Scale  onto  Faces 

In  order  to  render  more  realistic  displays,  gray  scale  is  added  to 
them  [5].  This  is  accomplished  by  associating  with  each  face  in  the 
model  a  normalized  intensity  image  patch  of  the  face.  Although 
these  patches  are  currently  derived  from  a  single  image  ol  the 
scene,  we  plan  to  generate  them  from  multiple  images.  Geometric 
normalization,  which  eliminates  the  effects  of  perspective 
projection,  is  performed  on  the  patches.  We  also  hope  to  perform 
photometric  normalization  to  eliminate  the  effects  of  varying 
illumination  conditions.  Fig.  1 1  shows  the  results  of  adding  gray 
scale  to  the  faces  of  the  model.  On  a  color  display,  faces  and  parts 
of  faces  that  are  occluded  in  the  original  image  are  displayed  in  red. 
An  interesting  future  problem  involves  incrementally  updating  the 
intensity  patch  of  a  face  as  information  is  acquired  from  successive 
images.  Note  that  the  gray  scale  displays  might  also  be  useful  In 
performing  a  20  match  between  the  projected  image  of  the  model 
and  an  image  of  the  real  seem. 

5.  Multiple  Views 

This  section  describes  an  experiment  in  combining  information 
from  two  views  to  generate  the  scene  description.  The  3D 
information  shown  in  Fig.  9  is  derived  from  one  view  (viewing  the 
scene  from  the  "front”).  Another  set  of  vertices  and  edges, 
depicted  in  Fig.  12,  was  manually  generated  to  simulate  the 
information  available  from  an  apposing  point  of  view  (viewing  the 
scene  from  the  "back").  The  viewpoint  for  the  perspective 
drawings  of  Figs.  9  and  12  are  almost  the  same  to  allow  setter 
comparison  by  the  reader.  Notice  that  the  information  in  Fig. 
9  emphasizes  edges  and  vertices  that  are  facing  the  front  of  the 
scene,  while  vertices  and  edges  facing  the  back  of  the  scene  sre 
emphasized  In  Fig.  12. 

We  have  made  the  assumption  in  this  experiment  that  we  -Mswr,. 
the  exact  positions.  relative  to  the  first  view,  of  the  cameras  used  to 
obtain  the  second  view.  Therefore,  the  wire-frame  descriptions  in 
Figs.  9  and  12  can  be  expressed  in  the  same  coordinate  system. 
We  are  currently  working  on  the  problem  of  matching  such 
descriptions  with  a  modal  so  that  relative  poeibona  of  views  can  be 
automatically  determined. 


The  procedure  used  in  this  experiment  is  similar  to  the  one 
described  in  the  last  section,  except  that  matching  and  merging  of 
the  two  sets  of  wire- frames  is  also  required. 

First,  for  each  set  of  wire  frames,  edges  that  are  nearly  parallel 
and  very  close  to  each  other  are  merged.  Next,  each  connected 
group  of  edges  is  labeled  as  a  separate  wire-frame  object.  We  now 
want  to  merge  objects  derived  from  the  first  view  with  matching 
objects  derived  from  the  second  view.  Two  objects  are  said  to 
match  if  they  have  matching  vertices  or  edges.  The  requirements 
for  two  vertices,  one  from  each  object,  to  match  are:  (1)  they  must 
be  very  dose  together,  or  (2)  they  must  be  part  of  matching  edges, 
and  the  other  two  vertices  of  the  edges  match.  The  requirements 
for  two  edges,  one  from  each  abject,  to  match  are:  (1)  the  two 
vertices  of  one  edge  must  match  the  two  of  the  other,  or  (2)  one 
vertex  of  one  edge  matches  <y>e  vertex  of  the  other,  and  the  two 
edges  are  close  together  and  overlap  in  their  lengths.  These  rules 
are  used  in  a  relaxation  algorithm  to  obtain  matching  vertices  and 
edges. 

Two  matching  wire-frame  objects  are  merged  in  the  following 
manner.  First,  their  matching  vertices  are  merged.  The 
coordinates  of  each  resulting  vertex  are  those  of  the  midpoint  of  the 
line  connecting  the  two  initial  vertices.  Next,  the  matching  edges 
are  merged  by  using  a  type  of  "averaging"  to  obtain  a  resulting 
edge  for  two  initial  edges  that  do  not  coincide.  This  averaging  is 
based  on  the  observation  that  end  points  ol  edges  that  are  vertices 
generally  have  much  more  accurate  3-space  positions  than  end 
points  that  are  not  vertices.  Therefore,  the  vertex  end  points  are 
given  greater  weight  in  the  averaging  than  the  non-vertex  end 
points.  Finally,  all  other  edges  and  vertices  of  the  two  objects  are 
combined  to  generate  a  single  wire  frame  object. 

From  this  point  onward,  processing  continues  as  described  in  the 
previous  section.  Web  faces  are  generated  for  each  comer  of  each 
vertex,  the  web  faces  are  merged,  the  shape  of  incomplete  faces 
are  completed,  holes  in  faces  are  found,  and  vertical  wafis  are 
dropped  from  faces  floating  above  the  ground.  Fig.  13  shows 
several  perspective  views  of  the  resulting  scene  model. 

8.1.  Results  with  Multiple  Views 

There  are  two  important  deferences  between  the  scene  models 
shown  In  Figs.  13  and  10.  First,  the  one  In  Fig.  13  contains  more 
buNdfngs.  This  is  expected  because  more  wire-frame  data  Is 
aval  labia  in  constructing  this  model.  Second,  many  buMngs  that 
V9  OMcnoM  n  Dotn  moom  if#  mow  ftccuriwy  OMcnDvo  m  nv 
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on«  in  Kig.  13  That  is,  th«  positions  ot  vertices  and  edges  of  these 
buildings  are  more  precise.  There  ere  two  reasons  tor  this’.  (1) 
Since  more  wire-frame  data  is  available  for  reconstructing  these 
buildings,  we  obtain  high  accuracy  for  more  vertices  and  edges.  (2) 
Since  many  vertices  and  edges  are  redundantly  available  In  both 
sets  of  data,  their  positions  are  "averaged",  generally  decreasing 
the  amount  of  error. 

This  experiment  demonstrates  how  the  information  provided  by 
each  additional  view  allows  the  scene  model  to  be  gradually  refined 
and  made  more  complete. 

In  this  experiment,  only  wire-lrsme  objects  are  matched  and 
merged.  Our  next  step  will  be  to  match  and  merge  a  wire- frame 
description  with  a  scene  model.  Our  current  experiment  can  also 
be  thought  of  as  merging  wire  frames  with  a  scene  model  by  noting 
that  it  is  equivalent  to  having  generated  a  model  from  one  set  of 
wire  frames,  but  using  only  confirmed  vertices  and  edges  of  the 
model  to  match  and  merge  with  the  other  set  of  wire  frames.  This 
gives  an  indication  of  the  importance  oi  confirmed  information  for 
the  more  general  matching  and  merging  processes.  Our  next  step 
will  require  determining  which  parts  of  the  model,  both  confirmed 
and  unconfirmed,  require  modification.  Some  of  these  parts  may 
actually  have  to  be  pulled  apart  and  rebuilt,  while  others  may  merely 
require  modifications  to  their  3- space  locations. 

6.  Conclusion 

The  current  state  ol  the  3D  Mosaic  project  has  been  described. 
The  goal  of  this  project  is  to  acquire  a  detailed  30  model  of  a 
complex  scene  from  images.  A  useful  approach  to  this  problem  is 
to  acquire  the  model  in  an  incremental  manner,  over  a  sequence  of 
images  taken  from  multiple  viewpoints.  We  have  also  shown  that 
task-specific  knowledge  it  very  useful  in  interpreting  complex 
images.  Our  stsreo  analysis  component  uses  such  knowledge  for 
matching  features  in  the  images,  and  our  higher  level  reasoning 
component  usee  such  knowledge  for  reconstructing  shapes  from 
the  stereo  output. 

Hg.  14  dtaptay*  a  flow  chart  for  the  whole  system.  The  stereo 
analysis  extracts  30  wire-frame  descriptions  representing  portions 
of  boundaries  of  (he  buddings  in  the  scene.  A  surface-baaed  model 
repreeenting  an  approximation  of  the  scene  le  then  generated  from 
tie  wire- frame  descriptions.  This  model  should  be  ueeful  for  tasks 
such  as  matching,  photokilsipr station,  display  gsoeretion,  and 
path  planning.  On  a  color  display,  tie  images  in  Fig.  11  would  show 
red  for  parts  of  the  scene  not  yet  observed.  This  idea  can  be  ueed 


in  a  task  such  as  planning  flight  paths  for  reconnaissance,  where  a 
path  that  permits  viewing  the  maximum  amount  of  red  portions 
might  be  optimal. 

There  are  several  extensions  and  improvements  we  have  in  mind 
lor  our  system,  in  addition  to  continuing  our  experiments  with 
multiple  views  as  discussed  in  the  previous  section,  the  following 
a’e  our  main  tasks  for  the  immediate  future: 

1 .  Using  the  scene  mode)  for  matching.  This  is  required 
for  performing  model-based  image  understanding  and 
for  updating  the  model  with  information  obtained  Irom 
a  new  view. 

2.  Verifying  a  scene  model  in  a  top-down  manner  by 
projecting  hypothesized  edges  and  vertices  into  the 
image  plane  and  then  searching  for  them  in  the  image. 

3.  Increasing  the  amount  and  accuracy  of  the  wire-frame 
information  extracted  during  stereo  analysis.  More 
boundaries  of  buildings  in  (he  scene  than  shown  in  Fig. 

6  can  probably  be  extracted  by  directly  incorporating 
task-specific  knowledge  at  the  lowest  levels  in  the 
process  of  extracting  junctions  from  the  image. 

Acknowledgement 

Fumi  Komura  did  much  work  in  exploring  and  experimenting  with 
initial  concepts  dealing  with  this  project.  Duane  Williams  has 
provided  excellent  programming  support  and  many  ideas.  In 
addition,  Dave  McKeown,  Steve  Shafer,  and  David  Smith  have 
provided  useful  comments  and  criticism. 


References 

1 .  Baer.  A.,  Eastman,  C.,  and  Henrion,  M.  "Geometric  Modelling:  a 
Survey."  Computer- Aided  Design  11  (September  1979). 

2.  Baker,  H.  H.,  and  Binlord,  T.  O.  "Depth  from  Edge  and  Intensity 
Based  Stereo."  Proc.  IJCAI-81  (1961). 

3.  Barnard.  S.  T.  and  Thompson,  W.  B.  "Disparity  Analysis  of 
Images."  /FEE  Trans,  on  Pattern  Analysis  and  Machine  Intelligence 
PAMI-2,  4  (July  I960). 

4.  Barrow,  H.  G.,  Botlett,  R.  C.,  Garvey,  T.  D.,  Kramers, J.  H„ 
Tenenbaum,  J.M.,  and  Wolf,  H.  C.  "Experiments  in  Map-guided 
Photo  Interpretation."  Proc.  /JCA/-77  (August  1977). 

5.  Devlch,  R.  N.,  and  Wetnhaus,  F.  M.  "Image  Perspective 
Transformations."  Proc.  SPIE  (July  1980). 

6.  Dotya,  J.  "A  Truth  Maintenance  System."  Artificial  intelligence 
12  (1979),  231-272. 

7.  Dude,  R.O.  and  Hart  P.  E„  Pattern  Classification  and  Scene 
Anetytis  John  WMsy  and  Sons,  New  York,  1973. 


185 


8.  Eastman,  C.  M„  and  Preiss,  K.  A  Unified  View  of  Solid  Shape 
Modeling  Based  on  Consistency  Verification.  Carnegie-MeHon 
University,  September,  1981. 


1 2.  Lucas, B.  D„  and  Kanade,  T.  "An  Iterative  Image  Registration 
Technique  With  an  Application  to  Stereo  Vision."  Proc  IJCAI 81 
(August  1981). 


9.  Hannah.  M.  J.  Computer  Matching  of  Areas  in  Stereo  Images. 
Tech.  Rept.  AIM-239,  Stanford  University,  July,  1974. 

1 0.  Henderson,  R.  L.,  Miller,  W.  J ,  and  Grosch,  C.  B  "Automatic 
Stereo  Reconstruction  of  Man-made  Targets."  Proc.  SPIE  186 
(August  1979). 


13.  Marr,  D.,  and  Poggio,  T.  ” A  Computational  Theory  of  Human 
Stereo  Vision."  Proc.  R.  Soc.  Land.  B  204  (1979). 

14.  Nevatia,  R.  and  Dabu,  K.  R.  An  Edge  Detection,  Linking  and 
Line  Finding  Program  image  Processing  Institute.  University  of 
Southern  California.  September,  197e 


11.  Liebes.  S.  "Geometric  Constraints  for  Interpreting  Images  of  I5-  Rubin,  S.  The  ARGOS  Image  Understanding  System.  PhD 
Common  Structural  Elements:  Orthogonal  Trihedral  Vertices."  Th„  Carnegie-Mellon  University,  1978 


Proc.  /.mage  Understanding  Workshop  (April  1981) 


16.  Shafer.  S.  A  ,  and  Kanade,  T.  Using  Shadows  in  Finding 
Surface  Orientations.  Tech.  Rtpt  CMU  CS-82-100.  Carnegie 
Mellon  University.  January.  1982. 


i '  ■ 

-  ,rjr  ^  -Tfr  «|^ 

,  • 

S  *  1 

* 

Figu  re  5:  Matches  that  have  been  found  for 
junctions  and  lines  in  the  two  images. 
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Figu  re  7:  Rectangular  boxes  indicate  geometric  constraints  on  topological 
nodes.  Arrows  indicate  direction  of  propagation  of  constraints. 
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Figu  re  8:  (a)  Initial  structure  graph.(b)  Link  4  is  deleted. 

(c)  Resulting  structure  graph  after  effects  of  deletion  have  been  propagated. 
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Figure  1 1 :  Reconstructed  buildings  with  gray  scale  mapped  onto  faces  Gray  scale 
values  were  derived  from  one  of  the  images  in  Fig.  1  In  a  color  display,  faces 
and  portions  of  faces  that  are  occluded  in  the  original  image  show  up  as  red 


Figure  12:  Three-dimensional  perspective  view  of  vertices  and  edges  generated 
manually  This  information  might  be  derived  from  stereo  analysis  of  images 
obtained  from  an  opposite  point  of  view  fi  om  that  shown  in  Fig.  1  The 
viewpoint  tor  this  perspective  drawing  is  almost  the  same  as  for  Fig.  9,  to 
allow  easier  comparison  by  the  reader 
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Figure  1 3:  Three-dimensional  perspective  views 
of  buildings  reconstructed  from  two  views. 


Figu  re  1 4:  30  Mosaic  flowchart,  showing  major  modules  (boxes)  and 
data  structures  (ellipses).  The  matcher  has  not  yet  been  implemented. 
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ABSTRACT 

A  fundamental  problem  in  computer  vision  is  bow,  given 
an  image,  to  determine  the  orientation  of  curves  and  surfaces  in 
space.  The  problem  is  difficult  because  metric  properties,  such 
as  orientation  and  length,  are  not  invariant  under  projection. 
Under  perspective  projection  (the  correct  model  for  most  real 
images)  the  transform  is  nonlinear,  and  therefore  hard  to  invert. 
Two  methods  are  presented:  one  finds  the  orientation  of  parallel 
lines  and  planes  by  locating  vanishing  points  and  vanishing  lines; 
the  other  determines  the  orientation  of  planes  by  backprojection 
of  angles. 


1  Introduction 


A  computational  theory  of  vision  must  explain  a  very  puz¬ 
zling  aspect  of  human  visual  experience.  How  is  it  that  we  cor¬ 
rectly  perceive  three-dimensional  properties  of  objects  in  space 
from  two-dimensional  projections  (e.g.,  a  single  image)?  At 
first,  it  seems  that  essential  information  for  depth  is  lost  when 
the  retinal  image  is  formed:  a  ray  of  light  may  as  well  have 
come  from  a  star  light-years  distant  as  from  across  the  room. 
Nevertheless,  we  have  definite  impressions  of  the  distances  and 
orientations  of  the  things  we  see,  even  when  there  is  no  explicit, 
unambiguous  information  about  these  three-space  relations  in 
the  image. 

There  are  a  few  purely  physical  mechanisms  that  can 
account  for  some  modes  of  spatial  perception  —  in  par¬ 
ticular,  accommodation  of  the  lens  to  focus  at  different  dis¬ 
tances,  binocular  stereopsis,  and  optic  flow.  But  while  these 
mechanisms  may  account  for  some  spatial  perception,  their  ex¬ 
planation  remains  insufficient  and  incomplete.  We  usually  have 
no  trouble  interpreting  single  images  with  substantial  ranges  of 
depth,  or  even  simple  line  drawings  with  an  infinite  number  of 
possible  interpretations.  Since  information  is  lost  in  projecting 
a  three-dimensional  scene  onto  a  two-dimensional  surface,  some 
form  of  computational  'cognitive'  model  is  required  to  con¬ 
struct  percepts  from  ambiguous,  incomplete,  and  noisy  images. 

Three  important  spatial  properties  that  we  perceive  are 
•in#  ,  shape,  and  dapth.  Size  and  shape  are  fundamentally 
different  from  depth  because  they  are  defined  relative  to  an 
object,  wfaHe  depth  is  defined  relative  to  an  observer.  Size 
is  usually  measured  with  ordinary  Euclidean  metrics:  length, 
ares,  and  volume.  It  is  difficult  to  give  a  precise  definition  of 
shape,  but  the  essential  principle  is  that  the  shape  of  an  object 
is  the  spa*  ul  arrangement  of  the  contours  and  surfaces  of  which 


it  is  composed.  While  size  is  independent  of  the  choice  of  a 
coordinate  system,  shape  usually  is  not.  Shape  is  often  specified 
in  some  “natural*  object-centered  coordinate  system  that  is 
selected  an*,  aligned  to  match  the  symmetry  of  the  object. 

In  what  follows,  we  shall  assume  that  shapes  can  be  ade¬ 
quately  described  by  straight  lines  and  planes.  These  primitive 
shape  descriptors  are  the  simplest  geometrical  contours  and  sur¬ 
faces  we  can  hope  to  find.  They  are  common  in  scenes  contain¬ 
ing  man-made  objects,  less  common  in  natural  scenes.  If  we 
can  develop  computational  methods  for  the  perception  of  lines 
and  planes,  we  can  perhaps  generalise  them  to  include  more 
complex  shapes. 

To  recover  3-D  shape  from  2-D  projections,  an  explicit 
model  of  the  projective  transform  is  essential.  Two  models  are 
common:  parallel  and  central  projection  (Figure  I).  In  paral¬ 
lel  projection  an  image  is  formed  by  parallel  rays,  usually  per¬ 
pendicular  to  the  image  plane.  In  central  projection  an  image 
is  formed  by  rays  passing  through  a  common  point  in  spare 
called  the  focal  point.  The  parallel  projective  transform  is  called 
“orthographic,"  the  central  projective  transform  “perspective." 

It  is  important  to  emphasize  that  central  projection  is  the 
correct  model  both  for  human  vision  and  for  cameras,  whereas 
parallel  projection  is  only  an  approximation. 

The  most  important  parameter  that  distinguishes  perspec¬ 
tive  from  orthographic  projection  is  the  included  angle  of  view, 
which  is  defined  to  be  the  maximum  angle  between  two  rays 
(i.e.,  the  angle  between  the  two  rays  with  the  greatest  angular 
separation).  The  assumption  of  orthographic  projection  is  es¬ 
sentially  equivalent  to  the  assumption  of  zero  included  angle 
of  view.  Locally,  perspective  projection  is  approximately  or¬ 
thographic  because  the  included  angle  of  view  is  small.  When 
the  entire  image  is  considered,  however,  perspective  is  impor¬ 
tant. 

If  the  focal  length  (the  perpendicular  distance  from  the 
focal  point  to  the  image  plane)  is  large  compared  to  the  linear 
dimension  of  the  image,  the  included  angle  of  view  is  small 
and  the  orthographic  approximation  is  reasonable.  Photographs 
taken  with  “normal*  lenses  for  a  certain  film  format  (e.g.,  a  50- 
mm  lens  on  a  35-mm  camera)  typically  cover  about  45  degrees 
of  view,  and  perspective  effects  are  often  quite  apparent.  If  a 
wide-angle  leas  is  used,  perspective  is  dominant  and  the  picture 
may  appear  distorted,  although  the  ‘distortion’  is  merely  the 
result  of  viewing  the  photograph  from  the  wrong  distance. 

Another  parameter  that  causes  perspective  images  to  differ 
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from  orthographic  ones,  even  when  the  included  angle  of  view 
is  small,  is  the  ratio  of  the  distances  to  objects  in  the  scene 
(or,  informally  stated,  the  ratios  of  “depths"  of  points).  Under 
perspective,  the  projected  area  of  an  object  varies  inversely  with 
the  object's  distance  from  the  focal  point  (measured  along  the 
principal  ray,  defined  to  be  the  ray  perpendicular  to  the  image 
plane).  Under  orthography,  however,  the  size  of  an  object  in  an 
image  is  independent  of  depth. 

The  perspective  camera  model,  therefore,  will  be  required 
for  accurate  recovery  of  size,  shape,  and  depth  whenever  the 
image  rovers  a  substantial  included  angle  of  view,  or  whenever 
objects  at  very  different  depths  are  compared. 

The  difference  between  orthographic  and  perspective 
projection  is  not  only  quantitative,  but  also  qualitative.  In 
Figure  2  one  of  the  most  familiar  of  all  illusions  —  the  Necker 
cube  —  is  shown  in  parallel  and  central  projection.  In  both 
cases  the  images  are  highly  ambiguous  because  they  could  have 
been  produced  by  an  infinite  number  of  objects;  nevertheless,  in 
each  case  we  perceive  only  two  distinct  interpretations.  The  in¬ 
terpretations  of  the  orthographic  image  are  more  or  less  equally 
preferable  because  both  have  the  same  symmetry.  The  in¬ 
terpretations  of  the  perspective  image,  however,  are  radically 
different:  one  is  a  symmetrical  cube  and  one  is  a  relatively  asym¬ 
metrical  octohedron.  There  are  other  qualitative  differences 
between  orthography  and  perspective.  For  example,  vanishing 
points  and  vanishing  lines  are  not  found  in  orthographic  projec¬ 
tions,  but  they  are  characteristic  of  perspective  projections. 
(This  topic  will  be  covered  in  detail  later.) 

Attempts  to  use  an  explicit  model  of  the  projective  trans¬ 
form  have  a  long  history  in  computer  vision.  Mackworth 
used  the  concept  of  gradient  apace  [I],  based  on  Huffman’s 
dual  apace  [2],  to  interpret  line  drawings  of  polyhedral  scenes. 
Gradient  space,  combined  with  the  parallel  projection,  is  a  use¬ 
ful  tool  because  physical  constraints  on  the  scene  can  be  repre¬ 
sented  as  relations  in  gradient  space.  Horn,  for  example,  used 
this  approach  in  his  analysis  of  “shape-from-shading"  (3).  An 
overview  of  gradient-space  methods  can  be  found  in  [4], 

For  reasons  that  will  be  made  clear  in  Section  2,  perspec¬ 
tive  involves  more  difficult  mathematics  than  does  orthography. 
(See  Haraliek  for  a  discussion  of  the  mathematics  of  perspec¬ 
tive  (5|.)  Most  computer  vision  approaches,  therefore,  begin 
with  the  assumption  of  parallel  projection.  One  notable  excep¬ 
tion  is  Kender  |S],  who  argued  that  gradient  spare  is  not  an 
ideal  domain  for  representing  constraints  under  perspective.  A 
different  domain  —  the  Gaussian  sphere,  which  will  be  described 
in  Section  3  —  is  much  more  useful. 


2  Mathematics  of  Perspective 


transform.  Given  a  point  p  —  (z,  p,  z),  the  parallel  projection 
p'  of  p  is  given  by: 


Central  projection,  on  the  other  band,  is  an  essentially  non¬ 
linear  transform:  image  coordinates  are  determined  by  dividing 
scene  coordinates  by  the  depth  as  measured  along  the  principal 
ray.  The  central  projection  p'  of  p  onto  an  image  plane  at  z  *=■ 
0  through  a  focal  point  at  (0,0,  —  /)  is  given  by 


P' 


,  z/  »/ 
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0)  . 


(2.2) 


The  perspective  transform  can  be  expressed  elegantly  with 
homogeneous  coordinates.  Homogeneous  coordinates  were  first 
developed  as  an  analytical  tool  for  projective  geometry  |7j;  more 
recently  they  have  been  used  effectively  in  computer  graphics  [8] 
and  industrial  automation  [9|.  The  homogeneous  coordinates  of 
a  point  are  represented  by  a  four-tuple  (z,y,  z,w),  and  the  or¬ 
dinary  three-dimensional  coordinates  of  the  point  are  obtained 
by  ( u  ’  5  •  w  )■  One  advantage  in  using  homogeneous  coordinates 
in  projective  geometry  derives  from  the  fact  that  points  ‘at 
infinity"  are  represented  as  four-tuples  with  w  =  0,  whereas  in 
ordinary  coordinates  they  have  no  representation. 

The  formulation  of  central  projection  in  homogeneous  coor¬ 
dinates  is  as  follows:  a  point  P  with  homogeneous  coordinates 
(z,  |f,  z,  w)  is  projected  onto  the  image  plane  at  point  q  with 
ordinary  coordinates  q  =  (j^£,  This  projection  can 

be  expressed  in  matrix  form  as 
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followed  by  conversion  to  ordinary  coordinates 

p  '-(-^ _ fl _ *J—\ 

'z  +  «>/’  z  +  wf'  z  +  wf'  ’ 
and  finally  a  “parallel”  projection  transform 


i  o)P,T. 
v0  o  0/ 


(2.3) 


(2.4) 


(2-5) 


■Clearly,  parallel  projection  is  s  special  case  of  central 
projection.  If  the  focal  length  is  infinite  Eq.  (2.3)  becomes  the 
identity  transform. 


2.1  Algebraic  Methods 

In  this  section  the  basic  mathematical  models  of  central 
projection  will  be  reviewed. 

Central  projection  can  be  represented  as  a  simple  linear 


2.1  A  Computational  Approach 

Geometric  properties  can  be  divided  into  two  classes: 
metric,  such  as  the  length  and  orientation  of  lines;  and  descrip¬ 
tive,  such  as  the  eolinearity  of  three  points  or  the  coincidence  of 
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three  lines.  Metric  properties  are  in  general  not  invariant  under 
projection  (either  parallel  or  central),  but  descriptive  ones  are. 
One  general  approach  to  using  a  camera  model  for  image  inter¬ 
pretation  (and  the  one  used  here)  is  to  first  identify  instances 
of  descriptive  attributes  in  the  image,  which,  since  they  are 
invariant  under  projection,  express  strong,  unambiguous,  and 
often  global  information  about  the  scene.  These  descriptive  at¬ 
tributes  can  then  be  combined  with  geometric  constraints  (the 
camera  model),  with  heuristic  rules  (such  as  a  preference  for 
symmetrical  figures),  and  with  specific  knowledge  of  the  scene 
to  infe.  metric  properties. 

I'nder  orthographic  projection  the  parallelism  of  lines  (a 
descriptive  property)  is  invariant,  but  under  perspective  projec¬ 
tion  it  is  not,  and  must  therefore  be  replaced  with  a  more 
general  property.  A  central  projection  maps  parallel  lines  in 
spare  onto  a  pencil  of  lines  intersecting  at  a  common  point  on 
the  image  plane.  This  point  of  intersection,  called  a  vanishing 
point,  has  important  implications  for  image  interpretation.  In 
p>  sportive.  consequently,  parallelism  as  a  descriptive  property 
is  replaced  by  “coincidence  ’  In  Section  3  a  computational 
method  for  finding  vanishing  points  is  described. 

I'nder  both  orthographic  and  central  projection  the  metric 
properties  of  angles  are  transformed  in  highly  ambiguous  ways. 
An  angle  on  a  plane  in  three-space  (defined  as  the  intersec¬ 
tion  of  two  lines  in  the  plane)  can  project  to  any  angle  in  the 
image,  depending  on  the  orientation  of  the  plane  with  respect 
to  the  image,  from  a  more  optimistic  standpoint,  we  can  say 
that  angles  in  the  image  constrain  the  orientation  of  planes  in 
three-space.  In  Section  4  an  algorithm  is  presented  for  finding 
the  orientation  of  planes  from  image  angles  and  heuristic  sym¬ 
metry  assumptions.  Computational  methods  for  shape  percep¬ 
tion  sometimes  exploit  known  or  suspected  symmetry  in  the 
scene  to  decide  among  multiple  interpretations.  Kanade  has 
used  this  approach  for  interpreting  orthographic  projections 
[101 

3  yanishing  Points 

This  section  describes  a  method  for  finding  vanishing  points 
in  a  perspective  image  and  discusses  how  to  use  them  to  inter¬ 
pret  t  he  scene. 

The  approach  is  based  on  the  assumption  that  there  exist 
groups  of  parallel  straight  structures  in  the  scene,  and  that 
these  structures  produce  line  segments  in  the  image.  According 
to  the  laws  of  perspective,  such  a  group  of  image  line  segments, 
when  extended,  will  intersect  at  a  common  vanishing  point. 
This  point  has  the  following  interpretation:  it  is  the  projection 
of  the  intersection  of  the  parallel  lines  “at  infinity.’  Once 
the  vanishing  point  is  located,  the  orientation  of  the  group  of 
parallel  three-space  lines  is  established  (assuming  that  the  focal 
length  is  known).  This  is  illustrated  in  Figure  3. 

The  problem  of  finding  vanishing  points  is  divided  into  ( I ) 
finding  line  segments  in  the  image  and  (2)  finding  intersections 
of  the  extended  line  segments  that  are  likely  to  be  vanishing 
points. 


Problem  (1)  is  solved  with  well-known,  conventional  tech¬ 
niques  that  will  not  be  discussed  here. 

Problem  (2).  that  of  finding  intersections,  is  greatly 
simplified  by  using  a  transform  called  Gaussian  mapping  (ll|. 
The  problem  with  trying  to  find  intersections  directly  in  the 
image  is  that  the  image  plane  is  an  open  space,  and  the  vanish¬ 
ing  points  may  occur  anywhere,  even  “at  infinity.'  (The  use  of 
gradient  space  to  represent  surface  orientation  raises  the  same 
problem:  as  the  surface  normal  approaches  90  degrees  from  the 
i-axis,  the  gradient-space  point  approaches  infinity.) 

Gaussian  mapping  transforms  vectors  in  three-space  into 
points  on  a  unit  sphere  at  the  origin  (Figure  4).  The  vectors  ran 
represent  either  planar  normals  or  the  direction  cosines  of  lines. 
Since  all  parallel  vectors  in  space  map  to  the  same  point  on  the 
sphere,  any  point  ou  the  sphere  represents  a  family  of  parallel 
vectors.  There  is  an  interesting  dual  relationship  between  lines 
and  planes  in  projective  space:  two  lines  determine  a  plane,  and 
two  planes  determine  a  line.  There  is  a  similar  dual  relationship 
on  the  Gaussian  sphere  between  points  and  lines  (i.e..  great 
circles)  A  point  on  the  sphere  determines  a  pole  through  the 
origin:  the  dual  of  the  point  is  the  equator  associated  with  the 
pole. 

The  interpretation  plane  associated  with  an  image  line 
is  defined  as  follows  (Figure  5).  Ia*t  p(  =  (r i.Vi.f)  and  P2  -- 
(yj.yj./)  be  two  distinet  image  points  defining  a  line  I  Then 
the  interpretation  plane  0  associated  with  I  is  t  he  plane  contain¬ 
ing  I  and  the  origin  (i.e..  the  focal  point),  and  can  be  represented 
by  its  unit  normal: 


The  plane  0  is  called  the  interpretation  plane  of  I  because  the 
line  in  spare,  the  projection  of  which  I,  must  lie  in  (>■ 

The  interpretation  planes  of  image  lines  intersect  the 
Gaussian  sphere  in  great  circles,  as  shown  in  Figure  0.  The 
intersections  of  these  great  circles  on  the  sphere  correspond  ex¬ 
actly  to  intersections  of  their  associated  lines  in  the  image  plane. 
The  proeedurr  for  finding  vanishing  points  is  t hen  as  follows:  ( 1 ) 
find  lines  in  the  image;  (2)  determine  the  interpretation  plaue 
of  each  line;  (3)  trace  the  great-circle  intersections  of  the  inter¬ 
pretation  planes  with  the  Gaussian  sphere;  (4)  find  the  points  on 
the  sphere  where  several  great  circles  intersect.  The  vanishing 
points  can  be  projected  back  onto  the  image  plane,  if  desired. 

After  a  vanishing  point  has  bern  found,  its  dual  interpreta¬ 
tion  ran  also  be  very  useful  for  interpreting  the  image.  The 
dual  of  a  vanishing  point  is  a  vanishing  line  (a  great  circle 
on  the  sphere).  For  example,  if  the  vertical  vanishing  point  is 
found,  its  dual  is  the  “horizon  line’  (i.e.,  the  vanishing  line  of 
all  horizontal  planes).  As  another  example,  if  two  horizontal 
vanishing  points  are  found,  their  duals  intersect  at  the  vertical 
vanishing  point. 

The  Gaussian  sphere  can  be  digitally  represented  as  a  two- 
dimensional  array  of  real  numbers,  with  the  row  index  indicat¬ 
ing  an  azimuth  and  a  column  index  representing  an  elevation. 
Kacb  array  element  corresponds  to  a  small  surface  area  of  the 
sphere.  (In  this  representation  the  surface  areas  are  not  equal. 
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Other  forms,  such  as  tessellated  regular  polyhedra,  might  be 
better  suited  to  represeuting  the  sphere,  but  they  are  rather 
complicated  to  implement.) 

The  procedure  for  tracing  a  great  circle  in  the  sphere  array 
is  as  follows.  Let  a  and  d  be  the  spherical  coordinates  of  a 
point,  where  a  is  the  azimuth  measured  from  the  z-axis  and 
d  is  the  elevation  measured  from  the  xy-plane.  In  Cartesian 
coordinates,  tbe  point  is 


g  «=  (sin  a  cos  d.  sin  d,  cos  a  cos  d)  . 


(32) 


The  equation  of  the  great  circle  associated  with  interpreta¬ 
tion  plane  0  =  is 


*•0  =  0 


(3.3) 


from  which  we  can  derive 


d(a,0)  =  tan' 


-Ot  sin  a  —  0:  cos  a 

0„ 


(3.4) 


If  ot  is  small,  this  equation  can  be  replaced  by  a  slightly 
different  form  that  gives  azimuth  as  a  function  of  elevation  and 
has  0 2  in  the  denominator. 

The  array  is  first  initialized  to  zeros.  Since  we  have  the 
elevation  as  a  function  of  the  azimut  b  a  and  an  interpretation 
plane  o  (3.1),  we  can  generate  the  greit  circle  of  0  in  the  array. 
When  a  curve  is  traced  into  the  array,  a  real  value  associated 
with  the  curve  is  added  to  all  the  array  elements  containing  the 
curve.  (This  value  is  derived  beuristically  from  the  length  and 
goodncss-of-fit  of  the  image  line.)  Points  where  many  curves  in¬ 
tersect  form  clusters  of  high  values:  these  indicate  likely  vanish¬ 
ing  points. 

Figures  7  and  3  show  two  examples  of  the  method's  applica¬ 
tion  to  real  images.  They  were  recorded  onto  50  mm  x  50  mm 
film  areas  with  a  50-mm-focal-length  lens  (considered  'wide- 
angle'  for  this  film  format)  and  were  digitized  at  a  resolution 
of  100  microns/pixel. 

In  Figure  7c.  only  the  vertical  vanishing  point  :s  rlearly 
found  on  the  (iaussian  sphere.  In  Figure  7d.  the  dual  of  t  he  ver¬ 
tical  vanishing  point  is  shown  to  be  the  horizon  line  In  Figure 
7e.  those  image  line  segments  whose  interpretation  planes  con¬ 
tain  the  vertical  vanishing  point  are  shown  In  Figure  3  two 
horizontal  vanishing  points  are  found:  tbe  vertical  vanishing 
point  is  located  by  intersecting  their  duals. 


4  Orientation  of  Planes 

In  this  section,  we  shall  consider  a  somewhat  different  algo¬ 
rithm  that  also  uses  Gaussian  mapping  to  deal  with  the  perspec¬ 
tive  camera  model  The  object  of  the  algorithm  is  to  find  the 
orientation  of  a  plane  in  three-space  from  a  perspective  image  of 
a  line  figure  drawn  on  tbe  plane.  Heuristic  assumptions  wbout 
the  figure  will  be  used  constrain  the  orientation  of  the  plane. 

Assume  that  tbe  line  figure  is  a  closed  planar  polygon 
with  nonintersecting  sides,  such  as  a  triangle.  Each  pair  of 
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adjacent  sides  defines  an  angle,  and  each  angle  in  the  planar 
figure  generally  projects  to  adifferent  angle  in  the  image  (Figure 
9).  (The  metric  property  of  angle  size  is  not  preserved  under 
projection.)  Nevertheless,  an  angle  measured  in  the  image  con¬ 
strains  the  angle  measured  in  the  planar  figure,  but  tbe  con¬ 
straint  ranges  over  a  family  of  possible  planar  orientations.  In 
essenre,  the  angle  in  the  image  can  “back project”  onto  any 
plane  in  space. 

Consider  two  interpretation  planes.  0t  =  (0 i,.di,.0i,) 
and  0S  =  (d2,.d2,.d2,),  associated  with  two  image  lines  that 
make  angle  in  the  image  plane  -j  =  (0.0.-I). 

h  x  di)-b  X  ds)  -  h  X  dilh  x  ds|cos^  .  ( 1.1 ) 


or.  substituting  for  -j. 


di,02.  +  di.de, 

COS.V  = 


\Jo  T.  +dx. 


(4.2) 


The  angle  dj  and  d2  make  on  a  plane  V  (bat  ha*  an 
arbitrary  orientation  of  o  in  azimuth  and  3  in  elevation  can  be 
found  by  rotating  the  image  plane  *■>  in  azimuth  and  elevation, 
and  then  evaluating  Eq.  (-4.1 1.  We  actually  use  a  slightly 
different  approach,  rotating  0\  and  02  by  ~o  in  azimuth  and 
—  3  in  elevation,  and  then  evaluating  Eq.  (4.2)  directly.  This 
rotation  of  an  interpretation  plane  0  into  plane  0 '  is  given  by 

(10  0  \/  coso  0  sino\ 

0  cos 3  -sin  IJ  0  l  0  Id  .  (4.3) 

0  sin  3  co*3  /\~sino  0  coso/ 


I  sjug  (  |.2)  and  (1.3),  projected  angles  measured  in  the  image 
can  be  expressed  as  constraints  on  (he  orientation  of  the  plane 
containing  the  angle  in  space. 

One  approach  lo  using  the  back  project  ion  constraint  would 
»>c  to  solve  a  system  of  equations  in  the  form  of  (1.2).  but 
such  an  explicit  solution  appears  lo  be  very  difficult.  Instead, 
using  a  highly  parallel  algorithm,  we  backproject  each  image 
angle  onto  planes  of  all  possible  orientations  (subject  to  some 
discrete  (plant ideation  of  the  (iaussian  >phere).  obtaining  at 
each  orientation  a  value  that  expresses  the  angle  the  figure  must 
make  on  such  a  plane. 

Two  examples  of  this  method  .are  illustrated  in  Figures  10 
and  1 1 . 

The  triangle  in  Figure  10a  is  interpreted  as  an  image  of 
some  other  triangle  in  spare.  Each  of  the  three  angles  is  back- 
projected  ouio  planes  of  all  possible  orientations  (Figure  10b). 
Each  image  in  F’igure  10b  represents  a  map  of  the  back  of  the 
(iaussian  sphere.  Eacb  point  represents  a  possible  planar  orien¬ 
tation  in  terms  of  the  (iaussian  spherical  coordinates  of  the 
plane  s  normal.  The  intensity  value  at  each  point  is  directly  re¬ 
lated  to  cos u,1  for  that  orientation  (e.g.,  black  indicates  cosir  = 
—  l  and  white  indicates  cos 0  =*  1 ). 

Knowledge  or  heuristic  assumptions  about  tbe  values  of 
angles  in  space  ran  be  used  to  choose  particular  interpretations 
of  planar  orientation.  For  example,  suppose  we  interpret  the 
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triangle  as  bring  as  symmetrical  as  possible  namely,  an  equi¬ 
lateral  triangle  (j  =  60  degrees).  Contours  for  this  value  of 
omega  are  shown  in  the  Figure  10b.  Note  that  the  triangle 
yields  two  solutions.  This  result  is  consistent  with  human  per¬ 
ception  of  the  figures. 

Similarly,  the  quadrangle  in  Figure  11  a  is  interpreted  as  an 
image  of  some  other  quadrangle  in  space.  In  this  case  four  angles 
are  backprojected  (Figure  lib).  If  we  assume  the  quadrangle  in 
space  is  a  square  (w  =  90  degrees),  and  plot  the  contours  for 
this  value  (shown  in  Figure  1  lb),  we  find  that  it  yields  a  unique 
solution. 


5  Conclusions 

The  perspective  camera  model  is  crucial  for  the  interpreta¬ 
tion  of  real  images.  Although  parallel  projection  provides  an 
adequate  approximation  when  the  included  angle  of  view  and 
the  range  of  depth  in  the  scene  are  small,  these  conditions  are 
never  completely  satisfied.  Perspective  camera  modeling  entails 
more  difficult  mathematics  than  does  orthography,  but  it  also 
provides  more  powerful  aids  to  perception  (e  g.,  the  Necker  cube 
example  in  Section  1). 

Gaussian  mapping  is  a  useful  tool  for  the  analysis  of 
perspective  images.  In  Section  3,  Gaussian  mapping  was  used 
to  identify  descriptive  geometric  properties  (the  coincidence  of 
parallel  lines),  and  to  infer  metric  properties  (the  orientation  of 
groups  of  parallel  lines).  The  dual  interpretation  of  vanishing 
points  on  the  Gaussian  sphere  was  used  to  extend  the  analysis 
to  finding  vanishing  lines. 

In  Section  4.  we  described  the  technique  of  barkprojecting 
of  angles  in  the  image  to  constrain  the  interpretation  of  planes 
in  space.  Once  again  the  Gaussian  sphere  was  used  to  repre¬ 
sent  the  space  of  possible  interpretations.  Assumptions  about 
the  symmetry  of  figures  in  space,  combined  with  the  constraint 
surfaces  obtained  through  back  projection,  resulted  in  quantita¬ 
tive  measurement  of  the  orientation  of  the  figures. 
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N®*  4  Otudu  aippi^. 


(SOUTH  TOLE) 


Figure  7e)  Line  tegmenta  whoee  Interpretation  plane*  con¬ 
tain  the  vanishing  polut. 


Figure  7e)  Line*  mapped  onto  the  Gaussian  sphere  (with 
a  vertical  vanishing  point). 


Figure  7d)  The  dual  of  the  vertical  vanishing  point  (th< 
horlson  Una). 


Figure  la)  Original  Image. 


Figure  0  The  eagle  two  Interpretation  planet  make  on 
another  plane  In  epace. 


Figure  10b)  Back  project  Ion  of  anglee  onto  planet  of  all 
orientatlont,  thawing  eontourt  for  anglee  of  00  degreet 
(two  planar  orientatlont  are  potaible). 


Figure  10a)  A  triangle. 


Figure  I  la)  A  quadrangle. 
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A  method  is  proposed  for  determining  the  motion  of 
a  body  relative  to  a  fixed  environment  using  the  changing 
image  seen  by  a  camera  attached  to  the  body.  The  optical 
(low  in  the  image  plane  is  the  input,  while  the  instantaneous 
rotation  and  translation  of  the  body  are  the  output.  If  op¬ 
tical  How  could  be  determined  precisely,  it  would  only  have 
to  be  known  at  a  few  places  to  compu'c  the  parameters  of 
the  motion.  In  practice,  however,  the  measured  optical  flow 
will  be  somewhat  inaccurate.  It  is  therefore  advantageous 
to  consider  methods  which  use  as  much  of  the  available  in¬ 
formation  as  possible.  We  employ  a  least- squares  approach 
which  minimizes  some  measure  of  the  discrepancy  between 
the  measured  flow  and  that  predicted  from  the  computed 
motion  parameters  Several  different  error  norms  are  inves¬ 
tigated.  In  geucral,  our  algorithm  leads  to  a  system  of  non¬ 
linear  equations  from  which  the  motion  parameters  may  be 
computed  numerically.  However,  in  the  special  cases  where 
the  motion  of  the  camera  is  purely  transitional  or  purely 
rotational,  use  of  the  appropriate  norm  leads  to  a  system  of 
equations  from  which  these  parameters  can  be  determined 
in  closed  form.  ^ 
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1.  Introduction 

In  this  paper  we  investigate  the  problem  of  passive 
navigation  using  optical  flow  information.  Suppose  wc  are 
viewing  a  film.  Wc  wish  to  determine  the  motion  of  the 
camera  from  the  sequence  of  images,  assuming  that  the  in¬ 
stantaneous  velocity  of  the  brightness  patterns,  also  called 


the  optical  llow,  is  known  at  each  point  in  the  image. 
Several  schemata  for  computing  optical  (low  have  been  sug¬ 
gested  (e.g.  [2],  [3],  [5]).  Other  papers  (c.g.  [9],  [11],  [12]) 
have  previously  addressed  the  problem  of  passive  naviga¬ 
tion.  Three  approaches  can  be  taken  towards  a  solution 
which  wc  term  the  discrete,  the  differential  and  the  con¬ 
tinuous  approach. 

In  the  discrete  approach,  information  about  the  move¬ 
ment  of  brightness  patterns  at  only  a  few  points  is  used  to 
determine  the  motion  of  the  camera.  In  particular,  using 
such  an  approach,  one  attempts  to  identity  and  match  dis¬ 
crete  points  in  a  sequence  of  images.  Of  interest  in  this 
case  is  the  pliotogramiiietric  problem  of  determining  what 
the  minimum  number  of  poiuts  is  from  which  the  motion 
can  be  tabulated  for  a  given  number  of  images  [10,  [11], 
[12],  |16],  [17].  This  approach  requires  that  one  tracks  fea¬ 
tures,  or  identifies  corresponding  features  in  images  taken 
at  different  times.  In  their  work,  T  sai  and  liunag  [16]  as¬ 
sumed  that  such  corresponding  points  can  be  determined  in 
two  image.  Then  they  showed  that  in  general  seven  points 
are  sufficient  to  determine  the  motion  uniquely.  They  prove 
furthermore  that  such  points  have  to  satisfy  a  fairly  weak 
constraint.  Longuet-lliggins  work  [10]  is  fairly  similar  to 
[1CJ  but  he  fails  to  show  under  which  conditions  the  motion 
can  be  determined  uniquely  from  corresponding  points. 

In  the  differential  approach,  the  first  and  second  spatial 
partial  derivatives  of  the  optical  flow  arc  used  to  compute 
the  motion  of  a  camera  [6],  [9].  It  has  been  claimed  that  it 
is  sufficient  to  know  the  optical  flow  and  both  its  first  and 
second  derivatives  at  a  single  point  to  uniquely  determine 
the  motion  [9],  This  is  incorrect  (except  for  a  special  case) 
[1[.  Furthermore,  noise  in  the  measured  optical  flow  is 
accentuated  by  differentiation. 

In  the  continuous  approach,  the  whole  optical  flow 
field  is  used.  A  major  shortcoming  of  both  the  local  and 
differential  approaches  is  that  neither  allows  for  errors  in 
the  optical  flow  data.  This  is  why  we  choose  the  continuous 
approach  and  devise  a  least-squares  technique  to  determine 
.  »:  motion  of  the  camera  from  the  measured  optical  Sow. 
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The  proposed  algorithm  takes  the  abundance  of  available 
data  into  account  and  is  robust  enough  to  allow  numerical 
implementation. 

Independently,  Prazdny  chose  in  [13]  a  similar  ap¬ 
proach  to  ours.  He  also  proposes  the  use  of  a  least-squares 
method  to  determine  the  motion  parameters  but  never  dis¬ 
cusses  how  exactly  this  is  to  be  done.  Consequently,  he 
does  not  show  whether  his  scheme  can  be  used  to  uniquely 
determine  the  motion. 


2.  Technical  Prerequisites 

In  this  section  we  review  the  equations  describing  the 
relation  between  the  motion  of  a  camera  and  the  optical 
flow  generated.  We  use  essentially  the  same  notation  as  [9). 
A  camera  is  assumed  to  move  through  a  static  environment. 
Let  a  coordinate  system  X,Y,Z  be  fixed  with  respect  to 
the  camera,  with  the  Z- axis  pointing  along  the  optical  axis. 
If  we  wish,  we  can  think  of  the  environment  as  moving  in 
relation  to  this  coordinate  system.  Any  rigid  body  motion 
can  be  resolved  into  two  factors,  a  translation  and  a  rota¬ 
tion.  We  will  denote  by  t  the  translational  component  of 
the  motion  of  the  camera  and  by  S  its  angular  velocity  (see 
also  Figure  1  which  is  redrawn  from  [9|).  Let  the  instan¬ 
taneous  coordinates  of  a  point  P  in  the  environment  be 

(x,r,z). 


Figure  1.  Coordinate  Systems 


(Note  that  Z  >  0  for  points  in  front  of  the  imaging  sys¬ 
tem.)  Let  f  be  the  vector  ( X,Y,Z)r ,  where  T  denotes  the 
transpose  of  a  vector,  then  the  velocity  of  P  with  respect 
to  the  X,  Y ,  Z  coordinate  system,  is: 


V  =  —  Q  X  f. 

We  deflne  the  components  of  f  and  £2  as: 

t  =  {U,V,W)T  £D  =  (A,  fl,  Cf. 


Thus  wc  can  rewrite  (1)  in  component  form: 

X'  =  -U  -BZ  +  CY 
Y  =  -V  -  CX  +AZ  (3) 

Z'  =  —W  —  AY  -f  BX. 

wbete  '  denotes  differentiation  with  respect  to  time. 

The  optical  flow  at  each  point  in  the  image  plane  is 
the  instantaneous  velocity  of  the  brightness  pattern  at  that 
point.  Let  (x,y)  denote  the  coordinates  of  a  point  in  the 
image  plane  (see  Figure  l).  Since  wc  assume  perspective 
projection  between  an  object  point  P  and  the  corresponding 
image  point  p,  the  coordinates  of  p  are: 

X  Y  ist 

I=Z  y=-.  (4) 

The  optical  Dow,  denoted  by  (u,u),  at  a  point  (z,y)  is: 


Differentiating  (4)  with  respect  to  time  and  using  (3)  we 
obtain  the  following  equations  for  the  optical  flow: 

X'  XZ‘ 

u  Z  Z* 

U  W 

=  (-  7  —  D  +  Cy)  —  z(-  —  -  Ay  +  Bx ) 

_r__Yv_  “> 

u  z  z* 

V  w 

=  (— -  -Cz  +  A)  — y(-  —  —  Ay  +  Bx). 

We  can  write  these  equations  in  the  form: 


U  =  U,  +  Ur 


t>  =  V,  +  V, 


where  (ut,ut)  denotes  the  trauslational  component  of  the 
optical  flow  and  (u,,vr)  the  rotational  component: 


-U  +  xW 


Ur  =  Ax y  —  J9(zj  +  1)  4-  Cy, 


Vt  =  Vr  =  -f  1)  _  Bxy  —  Cx. 

(8) 

So  far  we  have  considered  a  single  point  P.  To  define 
the  optical  flow  globally  we  assume  that  P  lies  on  a  surface 
defined  by  a  function  Z  =  Z(X,Y)  which  is  positive  for  all 
values  of  X  and  F.  With  any  surface  and  any  motioD  of 
a  camera  we  can  therefore  associate  a  certain  optical  Bow 
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aud  we  say  that  the  surface  and  the  motion  generate  this 
optical  flow. 

Optical  flow,  therefore,  depends  upon  the  six  parameters 
of  motion  of  the  camera  and  upon  the  surface  whose  images 
are  analyzed.  Can  all  these  unknowns  be  uniquely  recap¬ 
tured  solely  from  optical  flow?  The  answer  is  no.  To  see 
this,  consider  a  surface  S2  which  is  a  dilation  by  a  factor  k 
of  a  surface  S\.  Further,  let  two  motions  denoted  by  Mi 
and  M 2  have  the  same  rotational  component  and  let  their 
translational  components  be  proportional  to  each  other  by 
the  same  factor  k  (we  will  say  that  Mi  and  M2  are  similar). 
Then  the  optical  flow  generated  by  St  and  Mx  is  the  same 
as  the  optical  flow  generated  by  Sj  and  M 2.  This  follows 
directly  from  the  definition  of  optical  flow  (8).  It  is  still  an 
open  question  whether  there  are  any  other  pairs  of  distinct 
surfaces  and  molions  which  generate  the  same  optical  flow. 

Determining  the  motion  of  a  camera  from  optical  flow 
is  much  easier  if  we  arc  told  that  the  motion  is  purely 
translational  or  purely  rotational.  In  the  next  two  sections 
we  will  deal  with  these  two  special  cases.  Then  we  shall 
analyze  the  case  where  no  a  priori  assumptions  about  the 
■notion  of  the  camera  arc  made. 

3.  Translational  Case 

In  this  section  we  discuss  the  case  where  the  motion 
of  the  camera  is  assumed  to  be  purely  translational.  As 
before,  let  T  =  (U,V,W)  be  the  velocity  of  the  camera. 
Then  the  following  equations  hold  (see  (8)): 

-V  +  xW  -K  +  yW 

««  - - 5 -  *>t  =  - - - .  (9) 


3.1.  Similar  Surfaces  and  Similar  Motions 

It  will  be  sbowu  next  that  if  two  flows  generate  the 
same  optical  flow,  and  we  know  that  the  motions  arc  purely 
translational,  theu  the  two  surfaces  arc  similar  and  the  two 
camera  motions  arc  similar.  Let  Z\  and  Z2  be  two  surfaces 
aud  let  1\  =  (Ut,Vl,Wi)T  and  t2  =  (U3,  V3,  IV, )r  define 
two  different  motions  of  a  camera,  such  that  Z\  and  Tx  and 
Z3  and  7j  generate  the  same  optical  flow: 


-Ut  ±  zWy 
Zi 

-U,  +  X  W3 

z3 


._=*£**  (10) 
(I1) 

£3 


Eliminating  Z\,  Z3,  u  and  v  from  these  equations  we  obtain: 

-Ui+xWi  -U3  +  xW3 
-Vi+yWi  -V3  +  y  W3  1  ’ 

We  can  rewrite  this  equation  as: 

(-Ifj+xMfjX-Vj+yHf,) 

=  (-U3  +  xWt\-Vi+yWi),  (13) 


or: 

U\V3  -  xViWi  -  yU\W3  +  xyW\W3 

=  U3V,~  xVi  W2  -  yU3Wi  +  xyW2Wl .  (14) 

Since  we  assumed  that  Z3  and  ?j  and  Z3  and  T3  generate 
the  same  optical  flow,  the  above  equation  must  hold  for  all 
x  and  y.  Therefore  the  following  equations  have  to  hold: 

ViV3  =  U3VX 

— VjWi  =  -VjW,  (15) 

-U1W2  = 

These  equations  can  be  rewritten  as: 

Ui:V,:W3  =  U3:V2:Wt  (16) 

from  which  it  follows  that  Z3  is  a  dilation  of  Zx .  It  is  clear 
that  the  scaling  factor  between  Z\  and  Z3  (or  equivalently 
between  and  ?’2)  cannot  be  recovered  from  the  optical 
flow,  regardless  of  the  number  of  points  at  which  the  flow  is 
known.  By  uniquely  determining  the  motion  of  the  camera, 
we  will  mean  that  the  motion  is  uniquely  determined  up  to 
a  constant  scaling  factor. 

3.2.  Least-Squares  Formulation 

In  general,  the  direction  of  the  optical  flow  at  two 
points  in  the  image  plane  determine  the  motion  of  a  camera 
in  pure  translation  uniquely.  There  is  a  drawback  however 
to  utilizing  so  little  of  the  available  information.  The  opti¬ 
cal  flow  we  measure  is  corrupted  by  noise  and  it  is  desirable 
to  develop  a  robust  method  which  takes  this  into  account. 
Thus  we  suggest  using  a  least-squares  method  [4],  [14]  to 
determine  the  movement  parameters  and  the  surface  (i.e., 
the  best  fit  with  respect  to  some  norm). 

For  the  fallowing  we  assume  that  the  image  plane  is 
the  rectangle  xe|— u>,u>]  and  ye[— h,  hj.  The  same  method 
applies  if  the  image  has  some  other  shape.  (In  fact,  it  can 
be  used  on  sub-images  corresponding  to  individual  objects 
in  the  case  that  the  environment  contains  objects  which 
may  move  relative  to  one  another).  Furthermore  we  have  to 
assume  that  l/Z  is  a  bounded  function  and  that  the  set  of 
points  where  l/Z  is  discontinuous  is  of  measure  zero.  This 
condition  on  l/Z  assures  us  that  all  necessary  integrations 
can  be  carried  out.  We  wish  to  minimize  the  following 
expression: 

(17) 

In  this  case  then,  we  determine  the  best  fit  with  respect  to 
r  the  ML3  norm  which  is  defined  as: 

II  /(*.»)  11=  [J  l/(*.  »)1*  (18) 

J  —hJ—W 
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The  steps  in  the  least-squares  method  are  as  follows:  First 
we  determine  that  Z  which  minimizes  the  integrand  of  (17) 
at  every  point  f x,y).  Then  we  determine  the  values  of  U ,  V 
and  W  which  minimize  the  integral  (17). 

Let  us  introduce  the  following  abbreviations: 


a  =  —  U  +  xW  0  =  -V  +  yW.  (19) 


Note  that  the  expected  flow,  given  U,  V  and  W  is  simply: 


Then  we  can  rewrite  (17)  as: 


(20) 


|)J]dady. 


(21) 


We  proceed  now  with  the  first  step  of  our  minimization 
method.  Differentiating  the  integrand  of  (17)  with  respect 
to  Z  and  setting  the  resulting  expression  equal  to  zero 
yields: 

<22> 

Therefore  we  can  write  Z  as: 


a7  +  P3 

ua  -f-  v0' 


(23) 


This  equation,  by  the  way,  imposes  a  constraint  on  U,  V 
and  IV,  since  Z  must  be  positive.  We  do  not  make  use  of 
this  except  to  help  us  pick  amongst  two  opposite  solutions 
for  the  translational  velocity  later  on.  Note  that  now: 


U0  —  va 
a3  +"  f>3 


_  u0  —  ua 
“aJ  4-/3J 


(24) 


and  we  can  therefore  rewrite  (17)  as: 


(u0  —  va)3 

a3+03~ 


6.x  dy. 


(25) 


It  should  be  clear,  by  the  way,  that  uniformly  scaling  U,  V 
and  W  does  not  change  the  value  of  (25).  This  is  a  reflection 
of  the  fact  that  we  can  determine  the  motion  parameters 
only  up  to  a  constant  factor. 


i 

i 
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Figure  2.  Geometrical  Interpretation 


Before  proceeding  with  the  second  step,  we  give  a 
geometrical  interpretation  in  Figure  2  of  what  we  have  so 
far.  Suppose  that  the  motion  parameters  V,  V,  and  W  are 
given.  At  any  given  point,  say  (x0,  y0),  optical  flow  depends 
not  only  upon  the  motion  parameters  but  also  upon  the 
value  of  Z  at  that  point,  Z0  say.  However,  the  direction 
of  (u,  t>)  does  not  depend  upon  Z<,  The  point  (u,  t>)  must 
lie  along  the  line  L  in  the  uv-plane  defined  by  the  equation 
U0  —  va  =  0.  Let  the  measured  optical  flow  at  (z0,yo)  be 
denoted  by  (um,um),  and  let  the  closest  point  on  the  line 
L  be  (u0,  vb).  This  corresponds  to  a  particular  given 
by  (23).  The  remaining  error  is  the  distance  between  the 
point  (um,om)  and  the  line  L.  The  square  of  this  distance 
is  given  by  the  integrand  of  (25). 

For  the  second  step,  we  differentiate  (25)  with  respect 
to  U,  V  and  W  and  set  the  resulting  expressions  equal  to 
zero: 


rh  rW 

J-hJ-u 


0(u0  —  t>q)(tiO  -f  V0) 
-J-u,  (a3-\-03)3 


dx  dy  =  0 

-/■  r  M 

j-hJ-vf  I  t*2+07y 

fh  r  (y®  -  -  vo)(«o  +  v0)  „ 

/- - JFTW -  y  =  °' 


Let  us  introduce  the  following  abbreviation: 

(u0  —  vaXua  -f  v0) 

(a*+j8*)* 


(27) 


Then  equations  (26)  can  be  rewritten  as: 

/  /  |(-V  +  yW)K]dxdy  =  0 

J — hJ — 1> 

-IX.  |(-</  -f  xW)K)  dx  dy  =  0  (28) 
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r  H  rM> 

J  —  hJ  —  u> 


{{-yU  +  xV)K\dxdy  =  0. 


The  sum  of  U  times  the  first  integral,  V'  times  the  second 
integral,  and  \V  times  the  third  integral  is  identically  zero. 
Thus  the  three  equations  arc  linearly  dependent.  This  is  to 
be  expected,  for  if: 


f[kU ,  kV ,  kW)  =  f(U ,  V ,  IV),  (29) 


where  /  is  a  differentiable  function  and  k  a  constant,  then: 


uZl  +  vV+wlL- o. 

au  dV  aw 


(30) 


The  result  is  also  consistent  with  the  fact  that  only  two 
equations  arc  needed,  since  the  translational  velocity  can 
be  determined  only  up  to  a  constant  factor.  Unfortunately 
equations  (28)  are  nonlinear  in  U,V  and  IV  and  we  are  not 
able  to  show  that  they  have  a  unique  (up  to  a  constant 
scaling  factor)  solution. 


values  in  a  numerical  minimization. 

We  discuss  now  c  ur  least-squares  method  in  the  case 
where  the  norm  is  chosen  to  be  MLag.  First  we  determine 
Z  by  differentiating  the  integrand  of  (31)  with  respect  to  Z 
and  setting  the  result  equal  to  zero.  We  again  get  (22): 


(U-|)J+(v-f)^=°, 


(33) 


from  which  it  follows  that  (23): 

a2 +  02 


Z  = 


no  -f-  vp 


So  we  want  to  minimize: 

rh  /-ui 


f  f  (u0  —  va)2  dx  dy. 
J  —  h  J  —  m 


Let  us  call  this  integral  g(U  ,V  ,W),  then,  since: 

u0  —  va  —  ( vU  —  uV)  —  (xv  —  yu)W, 


(3d) 

(35) 

(36) 


3.3.  Using  a  Different  Norm 

There  is  a  way,  however,  to  devise  a  least-squares 
method  which  allows  us  to  display  a  closed  form  solution 
for  the  motion  parameters.  Instead  of  minimizing  (17),  we 
will  try  to  minimize  the  following  expression: 

CL  *-=z^+«.-=*±ffn 

X  (oJ  +  02)dxdy  (31) 


obtained  by  multiplying  the  integrand  of  (17)  by  a 3  +  01. 
Then  we  apply  the  same  least-squares  method  as  before 
to  (31).  When  the  measured  optical  flow  is  not  corrupted 
by  noise,  both  (31)  and  (17)  can  be  made  equal  to  zero  by 
substituting  the  correct  motion  parameters.  We  thus  obtain 
the  same  solution  for  the  motion  parameters  whether  we 
minimize  (31)  or  (17).  If  the  measured  optical  flow  is  not 
exact,  then  using  expression  (31)  for  our  minimization,  we 
obtain  the  best  fit  with  respect  not  to  the  M  Lj  norm,  but 
to  another  norm  which  we  call  the  MLa g  norm: 


II  /(*,»)  ||.p= 


/  /  (/(*,  »)]3(c»a  +  P,)dxdy. 

J  — KJ — W 


(32) 


Wbat  we  have  here  is  a  minimization  in  which  the  error 
contributions  are  weighted,  greater  importance  being  given 
to  points  where  the  optical  flow  velocity  is  larger.  This  is 
most  appropriate  when  the  measurement  of  larger  velocities 
is  more  accurate. 

Which  norm  gives  the  best  results  depends  on  the 
properties  of  the  noise  in  the  measured  optical  flow.  The 
Grst  norm  is  better  suited  to  the  sitation  where  the  noise  in 
the  measurements  is  independent  of  the  magnitude  of  the 
optical  flow.  Note  also  that  if  we  really  want  the  minimum 
with  respect  to  the  M Lj  norm,  we  can  use  the  results  of  the 
minimisation  with  respect  to  the  ML„p  norm  as  starting 


we  have: 

9{V,V,W) 

=  aU 2  +  6V*  +  cW2  +  2 dUV  +  2eVW  +  2 fWU,  (37) 


where: 


v 2  dx  dy 
u2  dx  dy 

{xv  ~yu)2dxdy 

tw 

uv  dx  dy 

-UI 

u(xv  —  yu)  dx  dy 

U 

v(xv  —  yu)  dx  dy. 


(38) 


Now  g(U,V,W)  cannot  be  negative,  and  g(U,  V,  IV)  =  0 
for  U  =  V  =  W  ==  0.  Thus  a  minimum  can  be  found 
by  inspection,  but  is  not  what  we  might  have  hoped  for.  In 
fact,  to  determine  the  translational  velocity  using  our  least- 
squares  method  we  have  to  solve  the  following  homogeneous 
equation  for  t\ 

Gt  —  0  (39) 


where  G  is  the  matrix: 


G  = 


(40) 


Clearly'  (39)  has  a  solution  other  then  zero  if  and  only  if 
Che  determinant  of  G  is  sero.  Then  the  three  equations 
(39)  are  linearly  dependent  and  t  can  be  determined  up 
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to  a  constant  factor.  In  general,  however,  as  the  data  is 
corrupted  by  noise,  g  cannot  he  made  equal  to  zero  for 
non-zero  translational  velocity  and  so  f  —  (0,0, 0)T  will  be 
the  only  solution  to  (39).  To  see  this  in  another  way,  note 
that  g  lias  the  following  form: 

g{kU,kV,k\V)^k2g(U,V,W)  (41) 

where  k  is  a  constant.  Clearly  g(U,  V,  IV)  assumes  its  min¬ 
imal  value  for  U  —  V  —  W  =  0. 

What  We  arc  really  interested  in,  is  determining  the 
direction  of  T  which  minimizes  g,  for  a  fixed  length  of  T. 
Hence  we  impose  the  constraint  that  'T  be  a  unit  vector. 
If  T  is  constrained  to  have  unit  magnitude,  the  minimum 
value  of  g  is  the  smallest  eigenvalue  of  the  matrix  G  and 
the  value  of  T  for  which  g  assumes  its  minimum  can  be 
found  by  determining  the  eigenvector  corresponding  to  this 
eigenvalue  [8].  This  follows  from  the  observation  that  g  is 
a  quadratic  form  which  can  be  written  as: 

g(U,V,W)  =  fTCf.  (42) 


Note  that  G  is  a  positive  semidefinite  hermitian  matrix 
as  a  >  0,  b  >  0,  c  >  0,  ab  >  d2,  6c  >  e2  and  ca  > 
/2.  (l'he  last  three  inequalities  follow  from  the  Cauchy- 
Schwarz  inequality  [7],  [8]).  Hence  all  eigenvalues  are  real 
and  non-negative  and  are  the  solutions  X  of  the  third  degree 
polynomial: 


Xs 

—  (a  +  6  +  c)X2 

{ab  +  be  +  ca  —  d2  —  c2  —  /2)X 
+  (ae2  +  bf2  +  cd 2  —  abc  —  2  def)  =  0. 


(43) 


There  is  an  explicit  formula  for  the  least  positive  root  in 
terms  of  the  real  and  imaginary  parts  of  the  roots  of  the 
quadratic  resolvent  of  the  cubic.  In  our  case  this  gives  us 
the  desired  smallest  root,  since  the  roots  cannot  be  negative. 
For  the  sake  of  completeness,  however,  various  pathologi¬ 
cal  cases  that  might  come  up  will  be  discussed  next,  even 
though  they  are  of  little  practical  interest. 

Note  that  X  =  0  is  an  eigenvalue  if  and  only  if  G  is 
singular,  that  is,  ir  tl  e  constant  term  in  the  polynomial 
(43)  equals  zero.  In  fact,  if  the  determinant  of  G  is  zero 
one  can  find  a  velocity  T  which  makes  g  zero.  It  follows 
from  a  theorem  in  calculus  that  this  happens  only  when  the 
optical  flow  is  cither  not  corrupted  by  noise  at  all  or  only 
at  a  few  points.  The  theorem  slates  that  if  the  integral  of 
the  square  of  a  bounded  and  continuous  function  is  zero 
then  the  function  itself  is  zero.  Hence  errors  can  only  occur 
at  points  where  the  optical  flow  is  discontinuous,  and  these 
are  exactly  the  points  where  the  surface  defined  by  Z  is 
discontinuous.  (These  arc  also  the  places  where  existing 
methods  for  computing  the  optical  flow  [5j  arc  subject  to 
large  errors). 

It  is  impossible  for  exactly  two  eigenvalues  to  be  zero, 
since  this  would  imply  that  the  coefficient  of  X  in  the  poly¬ 


nomial  (43)  equalled  zero,  while  the  coellicicnt  of  Xs  did 
not.  That  in  turn  would  imply  that  a6  =  d2,  be  --  e2,  and 
ca  ---  /2,  while  a,  b,  and  c  arc  not  all  zero.  For  equality  to 
hold  in  the  Cauchy-Schwarz  inequalities,  however,  u  and  v 
must  both  be  proportional  to  xv  —  yu.  This  can  only  be 
true  (for  all  x  and  y  in  the  image)  if  u— 1>= 0.  But  then  all 
six  integrals  become  zero  and  consequently  all  three  eigen¬ 
values  arc  zero.  I  bis  situation  is  of  little  interest,  since  it 
occurs  only  when  the  optical  flow  data  is  zero  everywhere. 

Then  the  velocity  is  zero  too.  Once  the  smallest  eigen¬ 
value  is  known,  it  is  straightforward  to  find  the  translational 
velocity  which  best  matches  the  given  data.  To  determine 
the  eigenvector  corresponding  to  an  eigenvalue,  say  X),  we 
have  to  solve  the  following  set  of  linear  equations: 

{a-\l)U  +  dV  +  fW  =  0 
dU  +  (b  —  Xi)F  -f  eW  =  0  (44) 

fU  +  eV  +  (c  -  X,)IV  =  0. 

As  X]  is  an  eigenvalue,  equations  (44)  are  linearly  depen¬ 
dent.  Let  us  for  a  moment  assume  that  all  eigenvalues  are 
distinct,  that  is,  the  rank  of  the  matrix  (G  —  X/),  where  / 
is  the  identity  matrix,  is  two.  Then  we  can  use  any  pair  of 
them  to  solve  for  U,  V  in  terms  of  W  say.  There  are  three 
ways  of  doing  this.  For  numerical  accuracy  we  may  add  the 
three  results  to  get  the  symmetrical  forms: 

U  =  {b  —  X,)(c  —  X,)  —  f(b  -—  X,) 

—  d(c  —  X1)  +  e(/  e) 

V  =  {c  —  Xi)(<i  —  Xi)  —  d(c  —  X,) 

-e(a-X,)  +  /(d  +  e-/)  ] 

W  =  (a  —  X,)(6  —  X,)  -e(a-X,) 

-/(6-X.H-  d(e  +  /-d). 

Note  that  Xj  will  be  very  small,  if  the  data  is  good,  and 
one  may  wish  to  just  approximate  the  exact  solution  by 
using  the  above  equations  with  Xt  set  to  zero.  (Then  there 
is  no  need  to  find  the  eigenvalue).  In  any  case,  the  result¬ 
ing  velocity  may  now  be  normalized  so  that  its  magnitude 
equals  one.  There  is  one  remaining  difficulty,  arising  from  t 

the  fact  that  if  T  is  a  solution  to  our  minimization  problem, 
so  is  —T.  Only  one  of  these  solution  will  correspond  to 
positive  values  of  Z  in  equation  (34)  however.  This  can  be 
easily  seen  by  evaluating  (34)  at  some  point  in  the  image. 

The  case  where  Ihc  two  smallest  eigenvalues  are  the  same 
will  be  discussed  in  one  of  the  next  paragraphs. 

There  is  a  simple  geometrical  interpretation  of  what 
we  have  done  so  far.  To  this  end  we  consider  the  surface 
defined  by  g{U,  V ,W)  —  k  where  k  is  a  constant.  Note 
that  we  can  always  find  a  new  coordinate  system  U,V,W 
in  which  g(U,V,W)  can  be  written  as:  < 

Xil/’  +  XjV2 +  X3lV,  =  k  (46) 

where  X,  for  i  =  1,2, 3  are  the  three  eigenvalues  of  the  1 
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quadratic  form.  If  the  eigenvalues  arc  all  non-zero,  the 
surface  g(U ,  V,  W)  =  k  is  an  ellipsoid  with  three  orthogonal 
semi-axes  of  length  s/k/\,.  We  are  particularly  interested 
in  the  case  where  the  constant  k  is  the  smallest  eigenvalue. 
Then  all  three  semi-axes  have  lengths  less  than  or  equal 
to  one.  Hence  the  ellipsoid  lies  within  the  unit  sphere.  If 
the  two  smallest  eigenvalues  are  distinct,  the  unit  sphere 
touches  the  ellipsoid  in  two  places,  corresponding  to  the 
largest  axis.  If  the  two  smaller  eigenvalues  happen  to  be  the 
same,  however,  the  unit  sphere  touches  the  ellipsoid  along 
a  circle  and  as  a  result  all  the  velocity  vectors  lying  in  a 
plane  spanned  by  two  eigenvectors  give  equally  low  errors. 
Finally,  if  all  three  eigenvalues  are  equal,  no  direction  for 
T  is  preferred,  since  the  ellipsoid  becomes  the  unit  sphere. 

The  case  where  exactly  one  eigenvalue  is  zero  also  has 
a  simple  geometrical  interpretation.  The  surface  defined  by 
g(U,V,W)  —  0  is  a  straight  line,  which  can  be  seen  easily 
from  the  following  equation: 

Xxt/2  +  XjVJ  =  0  (47) 

written  for  the  case  when  X3  is  zero.  (Remember  that  Xi 
and  Xj  arc  both  positive.)  Clearly  the  unit  sphere  intersects 
this  line  in  exactly  two  points,  one  of  them  corresponding 
to  positive  values  for  Z  in  equation  (34). 

The  method  which  we  just  described  can  be  easily 
implemented.  To  this  end,  the  problem  can  be  discretized. 
An  expression  similar  to  (31)  can  be  derived  where  the 
integrals  are  approximated  by  sums.  Our  minimization 
method  can  then  be  applied  to  these  sums.  The  result¬ 
ing  equations  are  similar  to  ones  described  in  this  section, 
with  summation  replacing  integration.  We  implemented 
the  resulting  algorithm  and  tested  it  using  synthetic  data 
including  additive  noise.  The  results  agreed  with  our  ex¬ 
pectations. 

One  can  use  the  ratio  of  the  biggest  to  the  small¬ 
est  eigenvalue,  the  so  called  <ondition  number  [!5j,  as  a 
measure  of  confidence  in  the  computed  velocity.  The  result 
is  very  sensitive  to  errors  in  the  measurements  unless  this 
ratio  is  much  bigger  than  one 

Curiously,  the  «»me  error  integral  as  (35)  is  obtained 
in  the  case  where  the  Kt Lzuv  norm  is  used: 

II  /(*.y)l|z«v  f  f  |/(i.y)^(*.y)|,(u!' -f  «'J)d*dy. 

J  —  hJ  —w 

(48) 

We  can  arrive  at  a  similar  solution  by  multiplying  the  in¬ 
tegral)  I  m  (17)  by  Z 2  instead  of  by  a9  +  02.  In  that  case 
the  minimization  is  carried  out  with  respect  to  the  MLz 
norm  defined  by: 


We  end  up  with  a  quadratic  form  similar  to  g,  only  the 
integrals  for  the  six  constants  corresponding  to  a,  b,  c,  d, 
e,  and  /  are  a  bit  more  complicated.  Curiously  they  only 
depend  on  the  direction  of  the  optical  flow  at  each  point, 
not  its  magnitude. 

Also,  other  constraints  could  be  used.  If  we  insist 
on  U2  -f  V2  =  1,  for  example,  we  obtain  a  quadratic 
instead  of  a  cubic  equation,  and  if  wc  use  W  =  1,  a  linear 
equation  only  need  to  be  solved.  The  disadvantage  of  these 
approaches  is  that  the  result  is  sensitive  to  the  orientation  of 
the  coordinate  axes.  Clearly,  iD  the  case  of  exact  data,  we 
get  the  right  solution  using  any  of  the  constraints  mentioned 
above. 

3.4.  Using  a  Different  Constraint 

The  minimization  scheme  discussed  in  the  previous  sec¬ 
tion  gives  us  a  unique  solution  in  most  cases  for  the  velocity 
vector  'T.  Here  we  propose  a  slightly  different  approach 
which  always  gives  us  a  unique  solution.  Note  that  apply¬ 
ing  the  first  step  in  our  minimization  method  gives  us  a 
constraint  between  the  values  of  Z,  the  velocity  vector  and 
the  optical  flow  at  eveiy  point.  Wc  can  in  addition  assume 
that  Z  =  Z0  is  known  at  a  particular  point,  say  (zo,yo). 
Using  the  MLzw  norm  in  our  scheme,  we  want  to  mini¬ 
mize: 

ll  £ iuz  ~  (~u + xw)] 2 + ivz  ~  (~v + yw)]2 

X  (u2  +  v2)dxdy.  (50) 

Differentiating  (50)  with  respect  to  Z,  and  setting  the  result- 
ing  expression  equal  to  zero,  we  obtain: 

<“> 

Thus  we  propose  to  solve  the  minimization  scheme  under 
the  following  constraint: 

+  vo)  ~  (uoQo  +  *A.)  =  0  (52) 

where  ttc  and  t>0  denote  the  components  of  the  optical 
Bow  measured  at  (zoilfo)  and  “o  and  A>  denote  a  and  0 
evaluated  at  (x0,yo)'  The  error  integral  (50)  becomes  after 
substituting  (51): 


£f  (u0  —  ua)1  dl  c 
hJ  —  U> 


which  is  the  same  as  (35)  and  is  denoted  by  g{U ,  V,  W)  (37). 
Thus  wc  want  to  minimise: 


/(*,  k)  IU  =  »)1*  d*  dV  l«)  g(U,V,W)  +  2\\Z{i(u\  f  »J)  -  (uo««  +  voA.)]  (54) 


Here  optical  Bow  velocities  for  point*  which  are  further 
away  are  weighted  more  heavily.  Tbit  is  most  appropriate 
when  the  measurement  of  larger  velocities  is  lest  accurate. 


where  X  denotes  a  Lagrangian  multiplier.  To  determine 
U,  V  and  W  the  following  linear  equations  obtained  by 


different!  ting  (54)  with  respect  t;o  U ,V ,\V  .ind  X  have  to 
be  solved: 

aU  +dV  +  /H'  +  Xuo  =0 
dU  -f-  bV  +  eW  +  Xo0  =  0 

fU  +  'V  -f  cW  -  X(i0«o  +  yono)  =  0  1 

u0U  -f-  n0V"  —  (i0uo  +  yoVo)W  •-=  —  i-'o(uo  +  u£). 

These  equations  can  be  written  in  the  form: 

Gx?x  =  ?  (56) 

where  tx  =  (U,V,W,\)T  and  P  =  (0,0,0, -Z0{u\  + 
vl))T.  Let  the  determinant  of  G\  be  Ao: 

Ao  =  (d.i  —  ah)(x0Uo  +  yo«o)S 
+(«2  -  bc)u30  +  (/*  -  ae)uj 
+2[(de  —  f>/)«o(x0uo  -f  y0v0)  (57) 

Mdf  —  ae)u0(*oUo  +  yo«o) 

4(cd  —  e/)u0t'ol- 

Assuming  that  A0  /  0  we  can  easily  determine  T\  from 
(55): 

fx  =  G^P.  (58) 

Introducing  the  following  abbreviation: 

,,  .  Z°(uo  +  »o) 


we  can  give  these  formulae  for  t\: 

U=  A>0(bc-e2)  |  •'«(*/  —  cd) 

+  (i0“0  +  yo»o)(5/  —  de)] 

V  =  K  [uo(e/  —  cd)  -f  v0(ac  —  /2) 

+  (xo«o  +  yo«o)(°*  —  d/))  (60) 

W  =  K[uo(de  —  bf)  +  va(df  —  at) 

+  (*ouo  +  youo)(d!  —  o6)i 
X  =  K[ae2  -f  cd2  +  bf 2  -  ate  -  2de/j. 

The  disadvantage  of  this  approach  is  that  the  result  depends 
upon  the  values  of  the  optical  flow  at  a  single  point.  To 
circumvent  this  problem  we  propose  to  determine  average 
values  for  U,  V  and  W  in  the  following  manner.  First  note 
that  we  are  only  interested  in  the  ratios  of  U /W  and  V /W 
which  obviously  do  not  depend  upon  the  (unknown)  value 
for  Zq.  Equivalently  we  could  determine  the  value  for  K 
from  the  condition  that  t  should  have  unit  length.  Hence 
we  can  determine  values  for  U,  V  and  W  which  depend  only 
upon  the  values  of  the  optical  flow  at  a  single  point  and  the 
coefficient  in  the  matrix  G.  If  we  want  to  remove  the  de¬ 
pendence  of  the  result  on  the  data  at  a  single  point,  we  can 
simply  average  the  values  obtained  for  all  image  points. 

In  the  case  where  the  data  is  exact,  i.e.,  where  the 


determinant  of  G,  denoted  by  detG,  vanishes,  the  solution 
for  T  is  the  same  one  as  obtained  using  no  constraint  in 
our  minimizations  scheme.  To  see  this  just  observe  that 
in  that  case  X  =  0  as  X  =  — If  detG).  In  the  case  where 
Ao  =  0,  equations  (55)  have  a  solution  only  when  detG  = 
0.  We  do  not  have  to  be  concerned  with  the  case  where 
A0  =  0  but  detG  ^  0  as  we  can  argue  that  equations 
(55)  always  have  to  have  a  solution.  Note  that  our  method 
is  based  on  the  condition  that  Z  is  a  certain  function  (51) 
of  U,  V,  W.  Hence  (52)  cannot  impose  a  constraint  which 
would  be  impossible  to  satisfy. 

The  methods  discussed  in  this  section  have  been  ap¬ 
plied  to  noisy  synthetic  data  with  the  expected  results. 

4.  Rotational  Case 

Suppose  now  that  the  motion  of  the  camera  is  purely 
rotational.  In  order  to  determine  the  motion  from  optical 
flow  we  again  use  a  least-squares  algorithm  with  the  ML 2 
norm  described  in  the  previous  section.  Recall  that  in  this 
case  the  optical  flow  is  (soe  (8)): 

ur  =  Axy  —  B(z2  +  1)  +  Cy 
vT  =  A(y2  +  1)  —  Bxy  —  Cx. 

We  will  show  now  in  an  analogous  fashion  to  section  3.1 
that  two  different  rotations,  say  Oj  =  [Au  B, ,  Ct)T  and 
£5j  =  (A2,B2,C2)T,  cannot  generate  the  same  optical  flow. 
Assuming  the  converse,  the  following  equations  have  to  hold 
for  all  values  of  x  and  y: 

Any  —  Bi(x2  +  l)-f  Ciy 

=  Atxy  —  B2(x2  -f  1)  +  C2y 
4i(y2  +  1)  —  Byxy  —  C\T  '  1 

=  A2(y2  -f  1)  —  Bax y  —  C7x 

from  which  we  can  immediately  deduce  that  Hi  =  fi>2. 

In  general,  the  direction  of  the  optical  flow  at  two 
points  and  its  magnitude  at  one  point  determine  the  motion 
of  a  camera  in  pure  rotation  uniquely.  We  choose  instead 
to  minimize  the  following  expression: 

/  f  |(u  —  ur)2  +  (v  —  v,)2]  dx  dy.  (63) 

At  the  motion  is  purely  rotational,  the  optical  flow  does 
not  depend  upon  the  distance  to  the  surface  and  therefore 
we  may  omit  the  first  step  in  our  method.  Thus  we  im¬ 
mediately  differentiate  (63)  with  respect  to  A,  B  and  C  and 
set  the  resulting  expressions  equal  to  zero: 

IJm  K“  -  ur)xy  +  (V  —  VrXv*  +  l)]dxdy  =  0 


I 


Thus,  provided  the  matrix  M  is  non-singular,  we  can  com¬ 
pute  the  rotation  as  follows: 


rh  fV> 

/  .  /  [(u  —  Ur)(x2  4  1)  +  (v  —  vr)xy\dxdy  =  0  (64) 

J  —  f\J  — Ui 

rh  r\u 

j ((“  —  u')y  ~(v  —  «r)xj  dxdy  =  0. 


Rewriting  the  above  equations  we  obtain: 

M  r» 

/  /  [uxy4t>(y24  l)]dxdy 

=  /  /  [«r*y  4  «v(y2  + 1)|  dx  dy 

J  —  hJ  -UI 

/*\  /*  ut 

/  /  M*2  4  1)  4  t<xy]  dx  dy 

eh  r  (65) 

=  /  /  (“r(x2  4  1)  4-  uriy)  dx  dy 

J  —  ft/  — u 
/*tu 

I  I  [uy  —  vz]  dx  dy 

y  ~y  —  ui 

fh  rW 

~  /  [u,y  —  urz]  da:  dy 

J  — flJ  — ui 

and  expanding  these  equations  yields: 

aA  +  dB  +  fC  =  k 

dA  4  60  -f  eC  =  7  (66) 

JA  4  eB  4  cC  —  m, 

where: 

a  =  /  f  i*V  4  (y2  4- 1)2]  dx  dy 

®  =  /  /  ((x2  4  l)2  4-xJy2]dxdy 

J  —  HJ  —  %v 

e==fJ  I1*  4  y2)  dx  dy 

w  ,  , 

/•a  /■«.  I67; 

d  =  —  /  /  [xy(x2  4  y2  4  2)]  dz  dy 

J  — KJ  — Ui 

«  =  -  /  f  y  dx  dy 

»  — H*  -til 

T—ff.**. 

and: 

/» 

*=/  /  (“xy4v(y2  +  l)jdxdy 

1=  ~  f  /  [u(x2  41)4  vxyj  dx  dy  (68) 

J  —  hJ  —%u 

f  hf  (UV  —  »*)  dx  dy. 


O  —  M^fi.  (70) 

It  is  easy  to  see  that  the  matrix  Af  is  non-singular  in  the 
special  case  of  a  rectangular  image  plane  since  then: 


d  —  e  —  7  =  0. 


So  in  the  case  of  a  rectangular  image  plane,  the  matrix  is 
diagonal,  which  makes  it  particularly  easy  to  compute  its 
inverse.  In  fact,  the  matrix  is  diagonal  if  the  image  plane 
is  symmetrical  about  the  x-axis  and  the  y-axis.  As  the 
extent  of  the  image  plane  decreases,  however,  the  matrix  M 
becomes  ill  conditioned.  That  is  inaccuracies  in  the  three 
integrals  [k,  I,  and  in)  computed  from  the  observed  flow 
are  greatly  magnified.  This  makes  sense  since  we  cannot 
expect  to  accurately  determine  the  component  of  rotation 
about  the  optical  axis  when  observations  arc  confined  to  a 
small  cone  of  directions  about  the  optical  axis. 

Again,  in  our  numerical  implementation  of  the  algo¬ 
rithm  the  integrals  in  (67)  can  be  approximated  by  sums. 
The  methods  discussed  in  this  section  have  been  applied  to 
noisy  synthetic  data  with  the  expected  results. 


5.  General  Motion 


We  would  like  now  to  apply  a  least-squares  algorithm 
to  determine  the  motion  of  a  camera  from  optical  flow 
given  no  a  priori  assumptions  about  the  motion.  It  is 
plain  that  a  least-squares  method  is  easiest  to  use  when  the 
resulting  equations  are  linear  in  all  the  motion  parameters. 
Unfortunately,  there  exists  no  norm  which  will  allow  us 
to  achieve  this  goal.  We  did  find  a  norm,  however,  which 
resulted  in  equations  that  are  linear  in  some  of  the  un¬ 
knowns  and  quadratic  in  the  others.  We  again  attack  the 
minimization  problem  using  the  MLap  norm  under  the 
constraint  that  f/2  4  l72  4  W2  =  l.  The  ensueing  equa¬ 
tions  arc  polynomials  in  the  unknowns  U,V,W,A,B  and 
C  and  can  be  solved  by  a  standard  iteration  method  like 
Newton’s  method  or  Bairstow’s  method  [14)  or  by  an  inter¬ 
polation  scheme  like  rcgult  falsi  (14).  The  expression  we 
wish  to  minimize  is: 


If  we  call  the  coefficient  matrix  in  (66)  M  and  the  column  fh  a  -  Q 

vector  on  the  right-hand  side  H,  then  we  have:  K°54/?2)  dx  dy. 

(72) 

MO  =  ft.  (69)  The  first  step  is  to  differentiate  the  integrand  of  (72)  with 

respect  to  Z  and  set  the  resulting  expression  equal  to  zero: 


(u  —  u,)a  +  (u  —  VT)0 


application  of  our  results  is  in  passive  navigation.  Here 
the  path  and  instantaneous  altitude  of  a  vehicle  is  to  be 
determined  from  information  gleaned  about  the  environ¬ 
ment  without  the  emission  of  sampling  radiation  from  the 
vehicle. 


We  introduce  the  Langrangian  multiplier  X  as  before  and 
attempt  to  minimize: 


rh  rui 

J  J  w  [(“  —  ur)P  —  (u  —  ur)a]s  dx  dy 

+  \(U2  +  V2  +  WJ  -  1).  (74) 


The  equations  we  have  to  solve  to  determine  the  motion 
parameters  are  obiained  by  differentiation: 

/h  p\u 

/  [(«  —  UT)f)  ~  (»  —  W)ot] 

-  hJ  —'til 

X  [ — xy/3  +  ( y 2  +  l)a]  dx  dy  =  0 


f  [  [(«  -  ",)/?  -  (u  -  u,)“] 

J  --HJ  — w 


X  |{i2  - 

f  00 

—  xya]  dx  dy 

=  0 

ft i  pw 

Lu • 

—  Ur)0  ~ 

-(«- 

w)«] 

X  [y/H 

-  xq]  i 

dx  dy 

=  0 

/:/> 

—  Ur)/?  - 

(u- 

vr)a] 

X  (v  — 

vr)di 

:  dy  +  \U 

=  0 

ph  put 

U-j' 

-  Ur)0- 

(v- 

ur)a] 

x  {u  — 

ur)dx  dy  —  XV 

=  0 

f  f  «• 

-ur)0- 

-(«- 

wH 

X  |(«  —  Ur] 

)y  +  («  — 

ur)z) 

dx  dy  -j-  XW 

=  0 

U2  +  V2  -f 

w1 

=  1. 

(75) 

Note  that  the  first  three  of  these  equations  are  linear  in 
A,  FJ  and  C  from  which  these  parameters  can  be  determined 
uniquely  in  terms  of  U,  V  and  IV.  Then  we  can  determine 
l/,V  and  IV  from  the  last  four  equations  by  a  numerical 
method.  To  this  end,  the  problem  can  be  discretized  and 
equations  analogous  to  (75)  derived,  where  summation  of 
the  appropriate  expressions  is  used  instead  of  integration. 

6.  Summary 

Our  objective  was  to  devise  a  method  for  determin¬ 
ing  the  motion  of  a  camera  from  optical  Bow  which  allows 
for  noise  in  the  measured  data.  The  least-squares  method 
which  we  proposed  in  this  paper  meets  our  goal  and  is 
also  suitable  for  numerical  implementation.  An  important 
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Abstract 

S'he  past  few  years  have  seen  a  growing  interest  in  the  Applica¬ 
tion  of  three-dimensional  image  processing.  With  the  increas¬ 
ing  demand  for  3-D  spatial  information  for  tasks  of  passive 
navigation  [Connery  I9H0[,  [Mora vec  19X0/,  automatic  surveil¬ 
lance  [Henderson  1979/,  aerial  cartography  [Kelly  1977/,  [I’ Anton 
I978[,  and  inspection  in  industrial  automation,  the  importance  of 
effective  stereo  analysis  has  been  made  quite  clear.  A  particular 
challenge  is  to  provide  reliable  anti  accurate  depth  data  for  input 
to  object  or  terrain  modelling  systems  (such  as  [lirooks  I98lbj). 
This  paper  describes  an  algorithm  for  such  stereo  sensing,  ft 
uses  an  edge-based  line-by-line  stereo  correspondence  scheme, 
and  appears  to  be  fast,  robust,  and  parallel  implcmentable.  The 
processing  consists  of  extracting  edge  descriptions  for  a  stereo 
pair  of  images,  linking  these  edges  to  their  nearest  neighbors 
to  obtain  the  edge  connectivity  structure,  correlating  the  edge 
descriptions  on  the  basis  of  local  edge  properties,  then  coopera¬ 
tively  removing  those  edge  correspondences  determined  to  be  in 
error  —  those  which  violate  the  connectivity  structure  of  the 
two  images.  A  further  correspondence  process,  using  a  technique 
similar  to  that  used  for  the  edges,  is  applied  to  the  image  inten¬ 
sity  values  over  intervals  defined  by  the  edge  correspondence. 
The  result  of  the  processing  is  a  full  image  array  disparity  map 
of  the  scene  viewed. 

V 


Area- based  versus  Edge-based  Analysis 


The  development  of  commercial  stereo  mapping  systems  has  been 
driven  by  requirements  of  terrestrial  cartography.  Cartography 
traditionally  concentrated  on  mapping  of  terrain,  producing 
elevation  conl.ur  maps  and  digital  terrain  data  bases.  The 
bulk  of  stereopiiotogrammctry  is  accomplished  manually  by 
direct  human  compilation.  There  is  growing  use  of  recently  de¬ 
veloped  interactive  systems  with  automated  stereo  correlation 
capabilities  although  these  systems  require  extensive  operator 
intervention  (see  [Friedman  1980]).  All  such  systems  use  area- 
based  cross-correlation.  To  the  extent  that  cross-correlation  is 
limited,  they  all  have  similar  limitations.  These  limitations  are 
in  mapping  accuracy  and  applicability  to  various  terrain  types 
(see  (Ryan  1980]  for  a  discussion  of  the  errors  and  difficulties 
inherent  in  window  shaping  and  cross-correlation). 

Area-based  stereo  matching  uses  windowing  mechanisms  to  iso¬ 
late  parts  of  two  images  for  cross-correlation,  Edge-based  stereo 
matching  uses  two  dimensional  convolution  operators  to  reduce 
an  image  to  a  depiction  of  its  intensity  boundaries,  which  can 
then  be  put  into  correspondence.  Area- based  cross-correlation 


techniques  require  distinctive  texture  within  the  area  of  correla¬ 
tion  for  successful  operation.  They  break  track: 

•  where  there  are  ambiguous  textures  or  feature¬ 
less  areas  (roofs,  sand  and  concrete); 

•  where  the  correlation  area  crosses  surface  discon¬ 
tinuities  (at  occlusions  such  as  buildings,  or  thin 
objects  (poles)); 

•  where  depth  is  ill-defined  (such  as  through  trees). 

In  general,  these  systems  break  track  where  there  is  no  local  cor¬ 
relation  (scru  signal,  or  where  two  images  do  not  correspond)  or 
where  the  correlation  is  ambiguous  (where  the  signal  is  repeti¬ 
tive).  The  systems  must  be  starteo  manually  and  corrected  when 
they  break  track. 

Edge-  baser!  analysis  appears  to  provide  a  solution  to  many  of  the 
problems  or  correlation.  The  system  described  here  deals  with 
edges,  in  its  initial  analysis,  because  of  the: 

a)  reduced  combinatorics  —  there  are  fewer  edges  than  pixels, 

b)  greater  accuracy  edges  can  be  positioned  to  sub-pixel 
precision,  while  area  positioning  precision  is  inversely 
proportional  to  window  sue,  and  considerably  poorer,  and 

c)  more  realistic  invariance  assumptions  —  area-based  analysis 
presupposes  that  the  photometric  properties  of  a  scene  are 
invariant  to  viewing  position,  while  edge-based  analysis 
works  with  the  assumption  that  it  is  the  geometric  properties 
that  are  invariant  to  viewing  position. 

Method  and  Constraint 

In  the  system  to  be  described,  edges  are  found  in  images  by  a 
simple  convolution  operator.  They  arc  ’  icalcd  at  positions  in  the 
image  where  *  change  in  sign  of  second  difference  in  intensity 
occurs.  A  particular  operator,  1  by  7  pixels  in  sisc1,  measures 
the  directional  first  difference  in  intensity  at  each  pixel.  Second 
differences  are  computed  from  these,  and  changes  in  sign  of 
these  second  differences  arc  used  to  interpolate  scro  crossings 
(i-t.  peaks  in  first  difference).  Certain  local  properties  other 
than  position  are  measured  and  associated  with  each  edge  — 
contrast,  orientation,  and  intensity  to  either  side  -  and  tint# 
are  kept  to  nearest  neighbours  above,  below,  and  to  the  sides.  It 
is  these  properties  that  define  an  edge  and  provide  the  basis  for 
the  correspondence  process  (sec  the  discussions  of  edge  matching 
in  {Arnold  1980],  [Baker  1980]). 

^The  eWne  operator  it  simple,  jaaically  one  dimensional,  and  is  noteworthy 
only  in  that  it  is  fat  and  fairly  effective. 
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A  stereo  pair  of  images  (from  Control  J 
Figure  I 


The  correspondence  is  a  search  for  edge  matches  across  images. 
Kigur •  2  shows  the  edges  found  in  the  two  images  of  Figure  I  w  th 
the  second  difference  operator  (note:  all  stereo  pairs  in  this  paper 
are  drawn  for  cross-eyed  viewing).  Although  the  operator  works 
in  both  horizontal  and  vertical  directions,  it  only  allows  matching 
on  edges  whose  horizontal  gradient  lies  above  the  noise  one 
standard  deviation  of  the  first  difference  in  intensity.  With  no 


the  stereo  pair  to  be  registered,  and  if  this  is  not  the  case  then 
the  appropriate  transformation  of  one  image  relative  to  the  other 
must  be  made  before  further  processing  is  done.  Note  that  a  less 
restrictive  solution  would  be  to  have  the  correspondence  process 
informed  of  the  camera  geometries,  and  have  it  solve  for  the 
more  general  cpipclar  situation. 


prior  knowledge  of  the  viewing  situation,  one  could  have  any  edge 
in  one  image  matching  ar.y  edge  in  l  other.  The  depictions  of 
Figure  2  have  about  3500  edges  each  the  combinatorics  of  a 
naive  matching  strategy  here  would  clearly  be  enormous. 


Figure  2 

By  constraining  the  geometry  of  the  cameras  during  picture 
taking  one  can  vastly  limit  trie  computation  that  is  required 
in  determining  corresponding  edges  !n  the  two  images.  If  two 
cquali valent  cameras  are  arranged  with  axes  parallel,  as  shown 
in  Figure  3,  then  they  can  be  conceived  of  as  sharing  a  •’ingle 
common  image  plane.  Any  point  in  the  scene  will  project  to  two 
points  on  that  joint  image  plane  (one  through  each  of  the  two 
lens  centers),  the  connection  of  which  will  produce  a  line  parallel 
to  the  baseline  between  the  cameras.  Thus  corresponding  edges 
in  the  two  images  must  lie  along  the  same  line  in  the  joint  image 
plane.  This  line  is  termed  an  epipoiar  line  (see  (ilallert  I960)). 
If  the  baseline  between  the  two  cameras  happens  to  be  parallel 
to  the  scanning  axis  of  the  cameras,  then  the  correspondence 
only  need  consider  edges  lying  along  matched  lines  parallel  to 
that  axis  in  the  two  images.  These  lines  are  termed  conjugate. 
Figure  3  indicates  this  camera  geometry  a  geometry  which 
produces  registered  images.  The  algorithm  described  assumes 


SpecinLxed  FCpipolai  geometry 
Figure  3 

Figure  5  shows  a  pair  of  conjugate  lines  from  the  stereo  pair  of 
Figure  1,  while  Figure  1  plots  the  actual  intensities  of  these  lines, 
seen  as  dots,  superimposed  on  the  edges  determined  by  the  edge 
operator.  Since  one  needs  to  compare  edges  only  fror.  conjugate 
lilies  in  the  two  images,  the  matching  process  can  be  applied  to 
the  edges  shown  in  Figure  i.  The  correspondence  could  proceed 
at  this  point  by  searching  for  the  'best'  assignment  of  edges 
along  such  conjugate  lines  an  assignment  which  op  : mixes 
some  goodness  measure  However,  normal  combinatoric  search 
is  quite  inadequate  here.  Typical  lines  have  upwards  of  n  =  30 
edges  eacn  and  the  combinatoric  space,  with  a  naive  upper  bound 
of  n!,  grows  rapidly  with  n  (the  CDC  imager  /  above  is  not  typi¬ 
cal,  rather  it  is  synthetic,  and  actually  quite  noise-fre*  in  con¬ 
trast  sec  the  images  of  Figure  10).  Even  with  extensive  heuristic 
pruning,  runs  with  a  combinatoric  search  approach  often  take 
sccords  per  line  ...  and  sometimes  minutes  (on  a  DF.C  KL  10). 
A  superior  approach  to  the  correspondence  task  lies  in  using  the 
Vitcrbi  algorithm  (Forney  1973),  a  dynamic  programming  tech¬ 
nique  used  extensively  in  speech  processing,  and  first  used  in  vi¬ 
sion  research  in  some  recent  work  at  Control  Bi-ta  Corporation 
[Hend’rson  1979).  An  earlier  use  of  a  similar  dynamic  nroeram- 
ming  technique  for  stereo  matching  is  documented  in  (Ciniel'iar* 
1972). 
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Rdges  of  these  lines  with  intensities  marked 
Figure  4 


Right  ;ind  Left  Image  ('on jugate  ICpipofar  Line  Intensities 
Figure  5 

What  distinguishes  the  Viterbi  trehnique  from  normal  search  is 
the  requirement  that  one  be  able  to  partition  the  original  prob¬ 
lem  into  two  si.bproblcrns,  each  of  which  can  be  solved  optimally 
and  whose  results  can  be  processed  to  yield  a  global  optimum 
Tor  the  original  problem  (‘optimal’  with  respect  to  an  evalua¬ 
tion  function  on  the  chosen  parameters).  In  a  recursive  way, 
each  of  the  subproblems  may  be  divided  and  the  solution  process 
repeated.  Geometrically,  the  partitioning  constraint  here  is  one 
of  monotonicity  of  edge  order.  A  left  right  ordering  of  edges 
in  one  image  cannot  correspond  to  a  right  left  ordering  in  the 
other;  i.e.  there  can  be  no  positional  reversals  of  edges  in  the 
image  plane.  This  constraint  allows  one  to  make  n  tentative 
assignments  of  an  edge  on  one  line  with  the  edges  of  the  con¬ 
jugate  line  in  the  other  image,  with  each  tentative  assignment 
partitioning  the  correspondence  problem  into  two  subprobloms. 
The  two  subprobloms  are  the  matching  of  edges  lying  to  the  left 
and  right  of  the  selected  edge  on  the  one  line  with  edges  lying 
to  the  left  and  right  respectively  of  its  tentative  match  on  the 
other  I'm-  .  The  optimality  criterion  selects  the  series  of  such  as- 
sigi  meets  judged  'best*.  This  constraint  excludes  from  analysis, 
for  the  time  being,  features  such  as  wires  or  overhanging  surface9 
which  lead  to  positional  reversals  in  the  image.  This  reversal  also 
causes  the  human  vision  system  trouble  we  can  fuse  one  or 
the  other,  the  nearer  or  the  farther,  hut  not  both  at  the  same 
time  (see  [Hurt  1980]). 

The  correspondence  process  could  use  the  edges  as  were  indicated 
in  Figure  2  above,  but  in  the  interests  of  robustness  and  ellicicncy 
a  different  approach  is  taken  here  A  lar  ;o  amount  of  small  scale 
detail  in  the  images  will  increase  the  cost  of  the  correlation  ex¬ 
ponentially.  deducing  the  level  or  detail  and  narrowing  the  ex¬ 
tent  of  the  required  search  will  reduce  the  computation  time  and 
enhance  noise  immunity  This  is  achieved  Ih rough  the  use  of  a 
coarse  to  fine  analysis  in  which  a  reduced  resolution  matching 
process  is  first  applied  to  bring  the  two  images  into  rough  cor¬ 
respondence.  The  removal  of  the  small  scale  detail  brings  quite 
a  reduction  in  the  number  of  edges  to  be  dealt  with.  Successive 
refinements  in  resolution  bring  successively  finer  detail  into  the 
analysis,  and  each  such  phase  can  use  the  results  of  the  previous 
lower  resolution  analysis  to  narrow  i  search.  Such  an  approach 
is  had  previous  successful  application  in  visual  processing  (e.y. 
[Marr  1977],  [Crimson  1980),  [Moravec  >980)),  and  has  relevant 
ties  to  the  neurophysiology  of  visi  a,  where  it  is  felt  a  multiple 
spatial  frequency  analysis  is  part  of  the  human  system’s  process¬ 
ing  [Wilson  1978]  (although  the  filtering  used  here  is  low  pass, 
and  not  bandpass).  It  was  our  intent  to  use  the  low  resolution 
components  of  the  images  to  determine  local  approximate  di#- 


parities,  and  to  use  these  as  guides  for  tin-  lull  resolution  match¬ 
ing.  To  obtain  the  resolution  reduction  we  use  a  linear  smooth¬ 
ing  filter  to  successively  halve  image  resolutions,  continuing  until 
the  image  noise  content  reaches  an  acceptably  low  po«nt  one 
brightness  level  ^smoothing  reduces  noise,  so  increases  signal-to- 
noise  ratio).  Figure  6  shows  the  edges  in  successive  resolution 
reductions  of  a  sample  line  pair  from  the  images  of  Figure  10, 
again  with  dots  marking  the  intensities. 


Right  and  Left  Image  1C  pi  polar  Line 
Successive  Resolution  Reductions 
Figure  6 

The  same  basic  second  difference  operator  is  used  throughout 
the  resolution  reduction  analyses,  but  its  size  and  noise- based 
thre  bolds  are  altered  to  keep  it  matched  to  the  characteristics 
of  the  ‘new’  reduced  resolution  image.  Tlie.sc  lowest  resolution 
edges  arc  matched  in  a  manner  to  be  described  below.  The 
intervals  specified  between  ncarest-ro.  responding  edge  pairs  and 
their  mates  in  the  -ther  image  define  local  disparities  to  be  used 
by  the  full  r<,tfolu*ion  correspond  nee  process. 

The  Edge  Matcjung  Process 

I  he  process  of  determining  edge"  coT«H*pnn deuces  is  basically 
the  same  for  both  Hie  reduced  resolution  and  the  full  resolution 
processes:  the  only  difference  is  in  the  set  of  parameters  used  by 
the  optimization  function.  I1  nil  resolution  matching  uses  edge 
orientation ,  side  intensities,  relative  disparity  (as  measured  by 
the  reduced  resolution  phase),  and  interval  compression  implied 
by  the  correspondence  (which  uses  ''dgc  position  to  determine  the 
foreshortening  or  scene  surfaces  analogous  to  human  spatial 
frequency  processing,  as  in  [Hlakemore  1970]).  In  reduced  resolu¬ 
tion  matching  side  intensities,  contrast,  and  interval  compression 
are  used.  These  parameter  measures  a’l  enter  the  computation 
as  probabilistic  weightings:  0  <  P  <  I. 

rCach  edge  from  a  line  in  the  reference  image  is  associated  with 
a  set  of  possible  matching  edges  from  the  conjugate  line  in  the 
other  image  (including  the  null  match,  which  would  imply  that 
the  edge  is  cither  spurious  or  is  obscured  in  the  other  image). 
Slightly  complicating  this,  each  such  edge  is  treated  as  a  doublet, 
being  a  left  side  (the  termination  of  the  interval  to  its  left)  and 
a  right  side  (the  start  of  the  interval  to  its  right);  left  sides  of 
edges  can  only  match  left  sides  of  edges,  right  sides  only  right 
sides  (quite  obviously).  This  left-right  distinction  is  essential  in 


domains  where  surfaces  may  occlude  one  another,  leaving  a  sur¬ 
face  to  one  side  of  an  edge  hidden  from  one  viewpoint  while  it  is 
visible  to  the  other.  ICach  side  of  an  t  go  is  termed  a  half-edge. 
For  each  pairing  in  the  set  of  possible  matches  the  static  prob¬ 
ability  of  correspondence  is  determined  this  is  Lhc  product  of 
all  of  the  mentioned  probability  measures  except  interval  com¬ 
pression  (which  is  determined  dynamically).  The  optimization 
process  then  uses  these  probabilities,  composing  them  with  the 
dynamic  interval  compression  probabilities  in  determining  the 
‘best’  correspondence  of  half-edges  along  a  line  in  one  image 
with  half-edges  along  the  corresponding  line  in  Lhc  other  image 
(details  in  [Hakcr  1981]).  This  computation  is  0(n3)  for  n  edges 
aloes  either  image  line  it  would  be  0(n2)  were  it  not  for  the 
use  01  interval  compression  probabilities. 

The  Connectivity  Constraint 

Figure  7  shows  the  results  of  the  whole  image  linc-by-linc  cor¬ 
respondence  process.  Wherever  there  is  a  noticeable  horizontal 
jag  across  the  image,  there  is  an  error  in  the  matching.  What 
is  really  being  depicted  here  is  change  in  disparity  along  con¬ 
nected  edges  in  each  image.  This  is  achieved  by  plotting  be¬ 
tween  the  connected  edges  of  an  image,  but  rather  than  using 
just  each  edge’s  coordinate,  use  its  coordinate  plus  associated 
disparity.  Thus,  when  a  connected  stretch  of  edges  in  one  image 
is  matched  to  various  parts  of  the  other  image,  the  drawing  will 
jump  horizontally  back  and  forth  in  the  other  image’s  space, 
touching  the  various  parts  matched  with  the  connected  stretch. 
At  these  horizontal  jumps,  the  process  is  suggesting  that  there  is 
a  large  change  in  depth.  This  is  a  suggestion  of  a  break  in  depth 
continuity 


Preliminary  Correlation  results 
Figure  7 

Let  us  emphasize,  the  dynamic  programming  algorithm  above 
performs  a  local  optimization  for  the  correlation  of  individual 
lines  in  the  image  it  uses  no  information  outside  of  those 
lines.  A  very  strong  global  constraint  is  apparent  and  available 
here,  that  of  edge  connectivity.  It  may  be  presumed  (by  general 
position)  that,  in  the  absence  of  other  information,  a  connected 
sequence  of  edges  in  one  image  should  be  seen  as  a  connected  se¬ 
quence  of  edges  in  the  other,  and  that  the  structure  in  the  scene 
underlying  these  observations  may  be  inferred  to  be  a  continuous 
surface  detail  or  a  continuous  surface  contour.  A  cooperative 
procedure  uses  this  connectivity  assumption  to  remove  edge  cor¬ 
respondences  which  violate  surface  continuity.  The  evidence  for 
these  mismatches  is  found  through  the  tracking  of  disparities 
along  connected  edges  on  adjacent  lines  (this  is  what  Figure  7 
depicts).  The  results  of  the  matching  after  this  process  has  func¬ 
tioned  are  shown  in  Figure  8  (witii  the  same  type  of  depiction  as 
Figure  7).  Figure  9  shows  a  perspective  view  of  these  connected 


edge  elements  in  depth.  Notice  that  this  figure  drawn  by  the 
system  shows  that  »L  has  truly  captured  something  of  the  3-D 
f.<  ’ucture  of  the  scene. 


Perspective  view  of  connected  edge  elements 
(the  z  axis  is  disparity,  not  depth) 

Figure  9 

Figures  II  through  M  show  the  various  stages  of  the  edge-based 
stereo  correspondence  process  when  applied  to  the  image  p.dr  of 
Fig  re  10.  This  data  (provided  by  the  Night  Vision  Laboratory  of 
the  U.S.  Army),  an  aerial  view  of  natural  terrain,  is  considerably 
noisier  and  significantly  more  detailed  than  the  synthetic  urban 
imagery  of  Figure  I.  U  more  clearly  demonstrates  the  impor¬ 
tance  of  the  interline  connectivity  constraint  and,  jis  shown  in 
Figure  6,  the  use  of  resolution  reduction  during  the  correspon¬ 
dence  process. 


Intensity  Based  Matching 

The  description  so  far  has  been  of  an  edge- based  stereo  cor¬ 
respondence  scheme  one  which  uses  a  Viterbi  optimality  con¬ 
dition  and  a  cooperative  continuity  enforcement  process  in  es¬ 
tablishing  reliable  matches  between  the  intensity  boundaries,  or 
edges,  of  a  stereo  pair  of  images.  Yet  there  is  much  more  in¬ 
formation  one  could  provide  about  the  depth  in  scenes  such  as 
those  depicted  in  Figures  1  and  10.  The  3-D  edge  descriptions 
of  Figures  9  and  M  just  highlight  the  structure  of  these  scenes, 
giving  rather  sparse  disparity  measures.  One  would  like  fuller 
stereo  detail  from  the  matching,  and  a  subsequent  correspon¬ 
dence  process,  this  time  based  on  image  intensity  values,  supplies 
this. 
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Figure  10 


Edges  of  the  stereo  pair 
Figure  11 
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Preliminary  correlation  results 
Figure  12 
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Final  (post-connectivity  constraint)  edge- based  results 
3700  half-edge  correlate  pairs 

Figure  13 


Perspective  view  of  connected  edge  elements 
Figure  14 


The  above  edge  bored  matching,  in  indicating  corresponding 
edges  in  the  two  images,  provides  strong  local  ‘vergcncc’  infor- 
m  lion  which  greatly  constrains  the  matching  problem  for  any 
remaining  correspondence  process.  One  can  take  unpaired  edges 
from  an  interval  along  one  epipolar  line  and  match  them  (again, 
via  Viterbi)  with  in  paired  edges  from  the  corresponding  inter¬ 
val  of  the  conjugate  line  in  the  other  image  (the  intervals  are 
bounded  by  either  matched  edge  pairs  or  the  periphery  of  the 
image).  This  matching  serves  to  “fill  in  the  gaps"  of  the  primary 
edge-baaed  correspondence.  A  final  correspondence  proc  98  (the 
fourth!)  takes  intensity  values  from  the  intervals  between  paired 
e  Iges  along  conjugate  lines  and  docs  yet  one  more  Viterbi  match¬ 
ing  on  them.  \/e  are  still  developing  the  metrics  used  in  this 
correspondence  process  —  at  the  present  we  use  1)  intensity 
variance  and  2)  deviation  from  linearity  of  the  interpolated  sur¬ 


face.  Figure  15  indicates  the  results  of  this  processing  on  two 
sample  line  )C  line  above,  NVL  below).  Arrowheads  (-* 
and  ♦—)  s  «v  half-edge  pairings  (to  the  left  side  and  to  the 
right  side,  respectively),  and  the  plotted  contours  show  the  dis¬ 
parity  vale  s  assigned  by  the  intensity  correspondence  procesr 
(depth  can  easily  be  deter  mined  from  disparity  once  the  camera 
parameters  are  known). 

Figures  16  and  18  show  stereo  perbp'etivc  views  of  the  full  cor¬ 
respondence  results  for  the  two  sets  of  imagery  (disparities  arc 
those  of  the  left  camera  image),  and  Figure's  17  and  19  show 
single  views  of  *Sc  same  surfaces  at  full  resolution.  These  figures 
show  that  the  correspondence  algorithm  produces  a  full  image 
array  disparity  map  of  the  viewed  scenes.  We  have  not  yet 
measured  its  accuracy. 
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Figure  15 

Performance 

The  correlation  algorithm  described  here  provides  this  three- 
dimensional  sensing  in  a  fast,  robust,  and  parallel  implementable 
way. 

[fast]  The  analyses  as  shown  in  Figures  9  and  M  took 
roughly  25  ami  35  seconds  apiece  from  image  input 
to  the  final  edge  correspondence  results  as  depicted. 
The  remaining  edge  and  intensity  correspondences 
of  Figures  17  and  19  required  a  further  45  and  15 
seconds,  respectively  (on  a  KL-10). 

[robust)  The  use  of  line-by-line  matching,  each  processed 
independently  (accumulating  a  good  deal  of  redun¬ 
dant  evidence),  and  the  use  of  a  coarse- to- fine  strategy 
(where  the  more  reliable  lower  frequency  components 
are  correlated  first)  have  been  seen  to  give  a  good 
basis  for  obtaining  the  correct  global  consensus  in  the 
subsequent  cooperative  process. 

[parallel  implementable ]  Since  there  is  no  interline  de¬ 
pendence  during  the  various  correspondence  processes, 
and  the  subsequent  cooperative  process  has  only  pair- 
wise  interline  interactions,  there  is  a  high  potential 
for  a  parallel  (n-proccssors  for  n  lines)  realization. 

Ahead 

There  is  still,  oT  course,  considerable  research  to  be  done  within 
this  depth  determination  process.  We  will  be: 

•  looking  into  improvement  issues  in  the  present 
algorithm,  such  as  having  the  matcher  U9c  colour 
information  and  using  stronger  global  informa¬ 
tion  to  both  aid  stercopsis  and  eliminate  false 
correspondences,  and 

•  testing  it  on  further  stereo  imagery,  both  aerial 
and  near  range,  using  digital  terrain  models  for 
accuracy  tests  where  they  arc  available. 


Further  afield,  we  are  interested  in  developing  such  a  correspon¬ 
dence  scheme  into  a  continuous,  multi-image  correlator,  capable 
of  integrating  analyses  over  a  scries  or  passively  sensed  stereo 
views  in  building  A  highly  accurate  and  detailed  map  of  scene 
depth. 

Our  group’s  plans  in  stereo  research  do  not  end  with  3-1)  sens¬ 
ing.  Stereo  for  us  is  a  tool  for  obtaining  better  descriptions  of  the 
environment.  These  descriptions  are  to  he  used  in  interpreting 
scenes,  for  our  goal  is  image  understanding  ACRONYM  [Brooks 
1981b)  is  a  very  successful  geometric  modelling  and  reasoning 
system  developed  here  at  Stanford  over  the  past  five  years.  It 
has  the  capability  of  manipulating  three-dimensional  models  of 
structures  that  it  has  been  taught  and  interpreting  presented 
imagery  in  the  context  of  these  models.  At  present  L  is  hand¬ 
icapped  with  strictly  two-dimensional  image  data.  To  remove 
this  restriction,  the  primary  thrust  of  our  stereo  research  over 
the  next  few  years  will  lie  in  developing  a  rule- based  stereo  map¬ 
ping  system  to  work  within  the  ACRONYM  system.  A  stereo 
mapping  system  as  described  here  will  he  one  of  the  components 
of  this  rule- based  system. 
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ABS  TRACT 

The  role  of  segmentation  in  the  ACRONYM  vision  system  is 
to  detect,  group,  and  represent  low-level  image  structures  such 
as  edges  and  curves  without  detailed  knowledge  of  objects  in 
the  scene  or  of  viewpoint.  These  structures  are  then  passed 
to  higher  levels  of  processing  which  do  use  such  knowledge  to 
predict  the  appearance  of  objects  in  images  and  then  to  match 
the  predictions  with  the  observed  structures.  Currently  this 
matching  takes  place  at  the  level  of  ribbons,  the  perspective 
projections  of  the  generalized  cylinders  from  which  ACRONYM'S 
models  are  built.  The  three  steps  in  finding  ribbons  are  the 
detection  and  linking  of  edges,  the  segmentation  of  linked  edges 
into  curves,  and  the  grouping  of  curves  into  ribbons;  of  these,  the 
first  two  have  recently  been  significantly  improved,  and  revision 
of  ribbon  grouping  operations  is  under  way.  Formerly,  linked 
edges  were  segmented  only  into  straight  lines,  so  that  providing 
curved  data  for  grouping  into  ribbons  should  lead  to  superior 
performance. 

1:  INTRODUCTION 

A  limiting  factor  in  the  performance  of  model-based  vision  sys¬ 
tems  has  long  been  the  quality  of  the  low-level  features  extracted 
from  images.  Without  good  low-level  data,  a  system  searching 
for  occurences  of  models  in  images  is  unable  to  take  advantage 
of  the  subtle  cues  to  three  dimensional  interpretation  on  which 
huma„  perception  seems  to  depend  so  heavily;  for  a  discussion  of 
such  cues  we  [Binford  1981 ].  Tim  goal  of  the  work  described  here 
is  the  accurate  detection  of  the  image  features  which  underlie 
some  of  these  cues:  edges,  smooth  curv  s,  corners  in  curves,  and 
groups  of  related  curves.  Without  accurate  knowledge  of  the 
ch  aracteristics  of  and  interrelationships  among  such  features,  in¬ 
ferences  from  cues  expressed  in  terms  of  these  features  cannot 
be  drawn  with  high  reliability. 

&  WHAT  IS_SEGMENTATION? 

In  what  follows,  the  term  segmentation  refers  to  the  process  of 
building  a  data-driven  symbolic  description  of  an  image  without 
using  detailed  knowledge  of  specific  objects  or  of  viewpoint.  In 
some  cases,  such  information  is  available  and  should  be  used. 
Iii  its  absence,  segmentation  relics  on  generic  knowledge  and 


careful  analysis  of  the  signal.  First,  significant  low-level  struc¬ 
tures  in  the  image  are  detected,  such  as  edges,  curves,  and  ‘wcll- 
formod’  regions-  Next  these  structures  arc  grouped  according 
to  very  general  principles  applicable  in  a  broad  range  of  cir¬ 
cumstances;  an  abbreviated  list  includes  proximity,  collinearity, 
parallelism,  and  connectivity.  Finally,  suitable  representations 
for  these  grouped  structures  are  computed  and  passed  to  higher 
levels  of  processing.  In  model  based  vision,  the  goal  is  often 
to  find  instances  of  previously  stored  models  in  an  image.  To 
this  end,  higher  levels  use  knowledge  of  specific  objects  to  match 
the  predicted  appearance  of  objects  in  images  with  the  grouped 
structures  discovered  during  segmentation. 

3j_ THE  ROLE  OF  SEGMENTATION  IN  ACRONYM 

In  ACRONYM,  the  matching  between  grouped  structures  dis¬ 
covered  during  segmentation  and  the  appearance  of  models 
predicted  by  higher  levels  of  processing  takes  place  at  the  rib¬ 
bon  level.  A  brief  digression  into  ACRONYM’S  modeling  system 
is  necessary  here.  A  ribbon  is  the  perspective  projection  of  a 
generalized  cylinder;  all  models  in  ACRONYM  are  built  from 
generalized  cylinders,  volumetric  primitives  first  introduced  in 
[Binford  1971].  A  generalized  cylinder  is  the  volume  swept  out 
by  a  simple,  closed  plane  curve  as  it  moves  along  a  space  curve; 
the  plane  curve  may  be  scaled  as  it  moves. 

In  a  somewhat  oversimplified  version  of  the  prediction  process, 
the  starling  point  is  a  model  consisting  of  a  group  of  generalized 
cylinders,  so  the  projection  of  the  model  is  equivalent  to  that 
of  the  group  of  generalized  cylinders.  But  since  a  ribbon  is  the 
perspective  projection  of  a  generalized  cylinder,  the  perspective 
projection  of  the  model  is  merely  a  group  of  ribbons.  Thus  higher 
levels  of  ACRONYM  predict  structures  to  be  sought  in  the  image 
in  terms  of  ribbons.  More  precisely,  higher  levels  of  processing 
predict  the  appearance  of  a  model  by  specifying  the  ribbons  likely 
to  be  observed  if  the  predicted  object  is  present  in  the  scene.  For 
more  details  of  prediction  in  ACRONYM,  see  (Brooks  1981). 

The  r-dc  of  segmentation  is  to  detect  all  the  ribbons  in  the  image 
and  to  pass  them  to  this  higher  level  of  processing.  The  process 
by  which  ribbons  arc  found  has  recently  been  improved  and  now 
consists  of  the  following  steps.  First,  points  in  the  image  which 
lie  on  edges  arc  detected  and  linked  into  extended  edges.  Here 
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an  edge  is  defined  as  a  discontinuity  in  image  intensity.  Next, 
cornc  s  in  the  extended  edges  are  found  and  smooth  curves  arc 
fitted  to  the  extended  edges  between  them.  Finally,  the  curves 
are  grouped  into  ribbons. 

These  operations  arc  being  implemented  in  ACRONYM  in  lour 
processing  stages.  Initially,  the  image  is  laterally  inhibited  by 
convolving  it  with  a  mask  designed  to.  remove  the  effects  of 
smooth  shading,  which  can  pose  problems  for  edge  detectors. 
In  ‘he  subsequent  edge  detection  and  linking,  points  on  edges 
arc  located  to  subpixe!  accuracy  and  linked  into  extended  edges 
based  on  local  information,  so  that  both  detection  and  linking  are 
completed  in  the  same  pass  over  the  image.  Next,  during  curve 
segmentation,  each  extended  edge  is  processed  to  find  corners  in 
the  edge,  to  smooth  the  edge  between  the  corners,  and  then  to 
approximate  the  edge  with  a  smooth  rurve.  In  the  final,  ribbon 
finding  stage,  curves  are  grouped  into  ribbons  by  such  operations 
as  jumping  gaps  in  curves  and  matching  extended  smooth  curves 
opposite  one  another  to  form  two  sides  of  a  ribbon. 

Currently,  the  first  three  of  these  stages  are  implemented  in 
ACRONYM,  leaving  only  the  ribbon  finder  to  be  completed  in 
improved  lorin.  Figures  1  - d  contain  examples  or  these  processing 
steps  for  several  typical  images.  Figure  I  is  a  bin  of  connecting 
rods,  courtesy  of  CM;  figure  2  is  an  aerial  photograph  of  an 
airplane  sitting  >n  the  runway  at  San  Francisco  International 
Airport;  figure  3  is  a  PC  board,  courtesy  of  SRI;  and  figure  i  is 
an  aerial  photograph  of  a  water  treatment  plant  in  Fort  Bel  voir, 
Virginia. 

The  previous  segmentation  algorithm  consisted  or  the  Ncvatia- 
Babu  line  finder  [Ncvatia  1978)  and  a  ribbon  finder.  This  line 
finder  incorporates  edge  detection,  linking,  and  fitting  of  straight 
lines  to  linked  edges;  the  ribbon  finder  then  groups  these  straight 
lines  into  ribbons.  Because  the  new  segmentation  algorithm  will 
find  more  edges,  will  find  them  more  accurately,  will  Olid  fewer 
edges  caused  by  smooth  shading,  and  will  segment  edges  into 
curved  elements,  the  completion  of  the  new  ribbon  finder  to 
group  those  curved  elements  info  ribbons  should  significantly 
improve  ACRONYM'S  performance. 

4:  DETAILS  OF  Till;  IV.PLLMI  NT  VHON 

As  noted  above,  the  four  steps  in  ACRONYM’S  segmentation 
algorithm  are  lateral  inhibition,  edge  detection  and  linking,  curve 
segmentation,  and  ribbon  finding.  What  follows  describes  each 
ol  these  steps  in  some  detail. 

y  Lateral  inhibition 

An  edge  is  a  discontinuity  in  the  intensity  of  the  continuous, 
unquantized  image  which  must  be  sampled  and  quantized  (i.e. 
digitized)  before  can  be  conveniently  manipulated  by  a  digi¬ 
tal  computer.  Unfortunately,  sampling  and  quantization  make 
impossible  a  generally  meaningful  definition  of  discontinuity. 


A  bin  of  connecting  rod a.  From  the  top:  original  image, 
mhibited  image,  extend*  d  tJges,  segmented  curves. 

Figure  I 
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A  printed  circuit  board.  From  the  top:  original  image,  laterally 
inhibited  image,  extended  edges,  segmented  carves. 


An  airplane  on  a  runway.  From  the  top:  original  image,  laterally 
inhibited  image,  extended  edges,  segmented  curves. 


Instead,  discontinuities  in  the  continuous,  unquantized  image 
often  appear  merely  as  gradients  in  the  digits1  image.  The  fact 
that  the  converse  is  not  true  is  a  major  obstacle  to  accurate 
edge  detection.  That  is,  the  problem  is  that  not  every  gradient, 
not  even  every  large  gradient,  corresponds  to  a  discontinuity  in 
the  original,  continuous,  unquuntized  image.  Thus  there  needs 
to  be  some  way  of  distinguishing  gradients  which  correspond  to 
discontinuities  from  those  which  do  not. 


Laterally  inhibiting  the  image  before  detecting  edges  helps  avoid 
a  common  pitfall  in  distinguishing  between  these  two  types 
of  gradients;  it  is  discussed  below.  The  operation  itself  is 
simply  a  convolution  of  the  image  with  a  rotationallv  symmetric 
mar'c  with  positive  center,  negative  surround,  and  zero  mean. 
Whenever  the  mask  ‘sits'  on  a  portion  of  the  image  where  the 
intensities  increase  linearly  in  any  direction  with  any  magnitude, 
the  result  is  zero.  The  current  implementation  of  ACRONYM 
uses  a  ‘difference  of  boxes’  mask,  where  the  entire  mask  and  its 
central  region  arc  square,  as  in  figure  5.  The  rotational  symmetry 
is  quite  rough,  but  keeping  track  of  column  sums  during  the 
convolution  enables  the  computations  to  proceed  very  quickly. 


* Difference  of  boxes  "  mask  for  lateral  inhibition. 
Figure  5 


When  a  perfect  digital  step  edge  is  laterally  inhibited,  the  tran¬ 
sition  of  the  resulting  sig.  il  from  por;tive  to  negative,  or  vice 
versa,  marks  the  location  of  the  edge.  This  transition  is  known 
as  a  zero  crossing  and  has  been  used  by  others  in  edge  detection 
(Horn  1972, Mai  r  1979a).  Once  the  image  is  laterally  inhibited, 
the  problem  of  detecting  edges  thus  becomes  one  of  detecting 
zero  crossings. 


A  common  pitfall  in  identifying  the  ‘right’  gradients  is  smooth 
shading,  defined  here  as  a  constant,  possibly  large,  gradient  over 
a  region  in  an  image.  A  human  observes  no  edges  in  such  a 
region,  but  an  edge  detector  which  treats  all  large  gra  lien»s  as 
edges  may  find  spurious  edges  throughout  the  smoothly  shaded 
area.  Figure  6  is  the  output  of  the  Nevatia  liabu  line  finder  for 
a  wider  view  of  the  airplane  in  2;  shading  causes  the  spurious 
edges  in  the  middle  of  the  fuselage.  In  this  situation,  at  least, 
identifying  the  ‘right'  gradients  on  the  basis  of  their  magnitude 


A  water  treatment  plant.  From  the  top:  original  image,  laterally 
inhibited  image,  extended  edges,  segmented  curves. 

Figure  i 


nlonc  is  too  simple  a  strategy.  Lateral  inhibition  avoids  this 
pitfall  because  it  is  equivalent  to  subtracting  the  linear  portion  of 
the  signal  from  itself.  Thus  laterally  inhibiting  smooth  shading, 
or  a  constant  gradient,  sends  the  signal  to  zero  because  it  has 
only  a  linear  component.  Since  no  zero  crossings  result,  no 
edges  are  detected,  and  the  desired  correspondence  with  human 
performance  is  achiever!. 


Result  of  the  Nevatia-Babu  line  finder  for  a  wider  view  of  the 
airplane  tn  figure  2. 

Figure  6 

The  second  row  of  figures  l,  2,  3,  and  4  show  the  results  of 
laterally  inhibiting  some  typical  images.  The  brighter  regions 
correspond  to  positive  values,  the  darker  to  negative,  and  a 
medium  gray  to  zero;  the  zero- crossings  therefore  appear  as  the 
boundaries  between  light  and  dark  regions. 

4.2  Edge  detection  and  linking 

For  the  reasons  just  stated,  zero  crossings  of  the  laterally  in* 
hibited  signal  are  used  to  localise  points  on  edges.  Once  a  zero 
crossing  between  a  pair  of  horizontally  or  vertically  adjacent 
pixels  is  found,  the  subpixel  location  of  the  underlying  edge  is 
approximated  by  linear  interpolation  between  the  values  of  the 
laterally  inhibited  signals  at  the  two  pixels.  The  strength  is  ap¬ 
proximated  by  the  difference  between  the  values  and  used  later 
during  the  processing  to  remove  noise. 

The  zero-crossings  are  linked  into  extended  edges  as  they  arc 
detected.  Each  junction  of  four  pixels  is  examined,  once  the 
zero-crossings  between  each  horizontally  and  vertically  adjacent 
pair  of  the  four  have  been  detected.  If  there  are  only  two,  they 
are  linked  together;  if  there  are  four,  they  arc  linked  in  pairs 


so  that  the  direction  of  the  transition  from  positive  to  negative 
is  constant  along  each  linked  edge.  (Naturally,  if  only  one  zero 
crossing  is  present,  no  linking  is  possible,  and  there  cannot  be 
three  transitions  from  positive  to  negative  without  a  fouilh.) 

One  traversal  of  the  picture  suffices  Tor  both  detecting  zero  cross¬ 
ings  and  linking  them  into  extended  edges.  The  algorithm  that 
makes  this  possible  is  roughly  as  follows.  Consider  visiting  each 
pixel  by  sweeping  each  row  from  left  to  right,  starting  at  the 
top  row  and  working  down,  and  suppose  the  next  pixel  to  be 
visited  is  in  the  middle  of  the  picture  somewhere.  A  record  has 
been  kept  of  the  extended  edges  linked  to  zero  crossings  between 
horizontally  adjacent  pixels  in  the  row  above,  and  of  the  extended 
edge  linked  to  that  between  the  vertically  adjacent  pair  of  pixels 
in  the  column  to  the  left,  of  the  same  row  and  the  row  above. 
Now  if  there  is  a  zero  crossing  between  the  current  pixel  and  its 
neighbor  on  the  left  or  that  above,  the  existing  extended  edges 
to  which  it  can  be  linked  are  available  from  these  records,  and 
the  newly-detected  zero  crossing  can  easily  be  added  to  the  ap¬ 
propriate  extended  edge.  The  records  are  updated  if  necessary 
so  that  yet-t.o-bc-dcteclcd  zero  crossings  can  be  added  to  the 
ncwly-cxtcndod  edges  when  appropriate,  and  the  detecting  and 
linking  continues.  Thus  keeping  track  of  at  most  one  row’s  plus 
one  cell’s  worth  of  extended  edges  at  a  time  enables  the  detec¬ 
tion  of  zero  crossings  and  their  linkage  into  extended  edges  to  be 
completed  in  one  pass. 

The  third  row  or  figures  1, 2,  3,  and  4  contain  sample  output  from 
the  edge  detection  and  linking  phase  for  some  typical  images. 

4.3  Curve  segmentation 

Each  extended  edge  is  segmented  by  examining  it  individually, 
locating  corners,  and  ntting  smooth  curves  to  approximate  the 
extended  edge  between  the  corners.  The  goal  is  to  extract 
the  important  structural  features  of  each  extended  edge  which 
tend  to  be  obscured  by  sensor  noise,  imperfections  in  the  im¬ 
aging  system,  digitization,  and  artifacts  of  previous  stages  of 
processing.  For  example,  one  might  expect  edges  in  the  perfectly 
sensed,  continuous  image  underlying  the  noisy,  digital  image  to 
be  characterized  by  smoothly  varying  tangents  over  most  of  their 
length  punctuated  by  occasional  discontinuities  in  tangent.  The 
processing  of  extended  edges  seeks  to  reconstruct  these  two  im¬ 
portant  features  by  locating  the  tangent  discontinuities,  less  for¬ 
mally  called  corners,  and  by  estimating  the  smoothly  varying 
tangent. 

Since  the  extended  edges  are  digital  curves,  there  is  once  again 
no  generally  useful  definition  of  tangent  discontinuity.  Moreover, 
corner s  in  the  extended  edge  are  usually  rounded  or  confounded 
with  small  but  sharp  corners  caused  by  noise.  For  these  reasons, 
the  criterion  for  detecting  corners  is  based  on  extrema  in  a  digital 
measure  of  the  curvature  of  the  extended  edge.  The  extremum 
occurs  at  the  largest  change  in  tangent  in  regions  where  all 
changes  in  tangent  may  be  large  and  thus  handles  blurred  corners 
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correctly.  A  threshold  on  the  measure  of  curvature  is  used  so 
that  corners  too  small  to  be  distinguished  from  random  noise 
arc  ignored.  The  corner-finding  process  is  for  a  single  edge  is 
illustrated  by  the  left  half  of  figure  7. 


Left,  corners  detected  in  an  extended  edge,  right,  smooth  curves 
fitted  to  edge  between  corners. 

Figure  7 

Once  the  corners  in  an  extended  edge  have  been  located,  it  is 
assumed  that  the  edge  in  the  perfectly  sensed,  continuous  image 
underlying  the  extended  edge  has  a  smoothly  varying  tangent 
between  each  pair  of  adjacent  corners.  Any  small  cornets  and 
wiggles  in  the  extended  edge  between  corners  arc  attributed 
to  noise  and  suppressed  by  the  subsequent  lilting  of  a  smooth 
curve.  Thus  the  smooth  curve  represents  an  estimate  of  the 
smoothly  varying  tangent  of  the  underlying  continuous  edge. 
This  estimate  is  constructed  by  smoothing  samples  of  the  tangent 
to  the  extended  edge  and  fitting  cubic  splines  to  them.  The  right 
half  of  figure  7  contains  an  example  or  this  step  for  a  single  edge. 

The  current  implementation  of  the  curve  segmonter  incorporates 
many  or  these  ideas.  The  remainder  will  soon  be  included.  A 
current  topic  of  research  is  the  development  of  a  spline  based 
on  estimates  of  curvature  to  replace  the  cubic  spline  now  used, 
which  is  based  on  estimates  of  tangent. 

The  bottom  row  of  figures  I,  2,  3,  and  4  contain  sample  output 
from  the  curve  segmenter  for  some  typical  images. 

4.4  Ribbon  finding 

The  ribbon  finder  tries  to  group  the  curves  produced  by  the  curve 
segmenter  into  ribbons.  A  variety  of  grouping  operations  are  in 
the  process  of  being  implemented  Here  only  a  few  of  the  more 
important  arc  mentioned  to  sketch  in  broad  outline  the  style  of 
computation  involved  in  this  stage  of  segmentation. 

The  linking  of  collincar  lines  is  a  simple  but  important  operation. 
The  higher  order  analogue  is  the  jumping  of  gaps  in  curves  that 
was  mentioned  earlier.  Here,  two  separate  curves  that  can  be 
interpreted  as  one  long,  smooth  curve  with  a  single  break  are 


joined  together.  Recognising  when  this  is  appropriate  involves 
not  just  the  positions  and  orientations  of  the  two  curves,  but 
their  curvatures  as  well.  The  efficient  storage  and  retrieval  of 
these  data  in  a  two-dimensional  domain  requires  sophisticated 
algorithms  and  data  structures  whose  exploration  is  an  active 
topic  of  research.  These  operations  arc  illustrated  by  figure  8. 


Left,  linking  collincar  tines;  right,  jumping  gaps  in  curves. 

Figure  8 

A  somewhat  more  complex  operation  is  the  matching  of  two  ex 
tended,  opposing  curves  to  form  the  longer  opposing  boundaries 
of  a  ribbon.  Figure  q  provides  an  example.  One  way  to  do 
this  is  point-to-point  matching  from  one  curve  to  another  using 
only  information  local  to  the  points  being  matched.  Algorithms 
which  perform  a  more  global  curve-to-curvc  correspondence 
based  on  representing  curves  in  terms  of  their  curvatures  at 
several  different  resolutions  are  being  developed. 


Two  extended  curves  with  opposing  boundaries  grouped  to  form 
a  ribbon;  the  dotted  line  indicates  the  ribbon’s  axis. 

Figure  9 

Again,  the  ribbon  finder  that  uses  curved  data  from  the  curve 
segmenter  in  not  yet  fully  implemented,  so  no  results  are  avail¬ 
able.  For  examples  of  the  performance  of  the  old  ribbon  finder, 
which  grouped  only  straight  lines  into  ribbons,  sec  [Brooks 
1070]. 
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A  general-purpose  scene-analysis  system  is 
described  which  uses  constraint -f iltering  tech¬ 
niques  to  apply  domain  knowledge  in  the  interpre¬ 
tation  of  the  regions  extracted  from  a  segmented 
image.  An  example  is  given  of  the  configuration 
of  the  system  for  a  particular  domain,  FLIR 
(Forward  Looking  InfraRed)  images,  as  well  as 
results  of  the  system's  performance  on  some  typi¬ 
cal  images  from  this  domain. 

1 .  Introduction  \ 

An  Image  (whether  on  the  human  retina,  on 
photographic  film,  or  In  some  electronic  device) 

Is  formed  by  a  complicated  interaction  of  light 
with  objects  In  three-dimensional  space  (a  scene) . 
Scene  analysis  is  the  process  of  unravelling  this 
interaction:  inferring  from  an  image  the  arrange¬ 
ment  of  lighting  and  objects  that  produced  it.  In 
theory,  this  problem  is  indeterminate:  A  given 
Image  may  result  from  many  different  scenes,  all 
of  which  happen  to  appear  Identical  from  the  ob¬ 
server's  viewpoint.  But  In  practice  there  are 
usually  sufficient  restrictions  on  allowable 
scenes  to  permit  essentially  only  one  interpreta¬ 
tion  of  the  image.  The  problem  is  to  find  this 
interpretation  efficiently.  Humans  are  clearly 
able  to  do  this.  Can  computers  achieve  similar 
performance? 

In  this  paper  we  present  a  method  for  scene 
analysis  based  on  the  application  of  constraint- 
filtering  techniques  to  a  network  of  regions  ex¬ 
tracted  from  an  image.  Such  an  approach  has  two 
chief  advantages.  First,  its  conceptual  simpli¬ 
city:  It  provides  a  clean  separation  between  the 
general  processing  algorithm  and  the  knowledge 
about  a  particular  domain,  which  is  expressed 
declaratively  as  constraints.  Second,  its 
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computational  speed:  Constraint-filtering  can  be 
decomposed  into  many  almost  independent  processes 
which  can  all  be  run  in  parallel  on  a  suitable 
multiprocessor  computer. 

To  try  this  approach,  we  have  implemented  a 
prototype  system  that  does  scene  analysis  by  con¬ 
straint  filtering.  A  diagram  giving  an  informal 
overview  of  the  system  is  shown  in  Figure  1.  In 
the  interest  of  expediency  we  have  made  many  sim¬ 
plifications.  For  example,  only  a  few  crude 
measurements  are  made  on  the  extracted  regions.  In 
the  following  sections,  we  will  describe  the  proto¬ 
type  system,  taking  note  of  these  simplifications. 
We  will  also  show  how  the  system  is  used  in  a 
particular  domain  —  forward-looking  infrared  (FLIR) 
images  of  battlefield  scenes.  This  domain  was 
chosen  partly  because  of  its  military  interest,  but 
primarily  because  its  moderate  complexity  Is  just 
about  right  for  fully  exercising  the  prototype 
system.  Then  we  will  discuss  the  system's  perfor¬ 
mance,  taking  care  to  distinguish  those  failures 
that  are  Inherent  ir  the  method  from  those  that  are 
merely  the  result  of  simplifications  made  in  this 
implementation,  and  finally,  we  will  suggest  direc¬ 
tions  for  further  progress. 

2.  Segmentation 

'  A  digital  image  is  merely  an  array  of  light 
intensity  (or  color)  values.  There  seems  to  be  no 
way  of  going  directly  from  these  values  to  a 
description  of  a  scene  in  terms  of  the  objects  in 
it.  As  argued  by  Barrow  and  Tenenbaum  (], 2), 

Marr  (3],  and  numerous  others,  several  stages  of 
processing  are  needed,  each  with  its  own  intermed¬ 
iate  representations  of  the  information  contained 
in  the.  image.  A  first  step  is  to  organize  the 
pixels  into  groupings  that  correspond  more  closely 
to  the  objects  in  the  scene. 

Typically,  this  is  done  by  segmenting  the  image 
into  regions  of  fairly  homogeneous  brightness.  For 
many  scenes,  this  is  a  reasonable  thing  to  do.  In 
most  cases,  the  regions  will  correspond  to  the  ob¬ 
jects  chemselves,  or  else  to  significant  pieces  of 
than.  By  this  means  the  my-iads  of  pixels  in  an 
image  can  be  reduced  to  a  few  score,  <jr  a  few 
hundred  regions,  considerably  decreasing  the  amount 
of  data  t;hat  must  be  processed,  but  with  little  loss 
of  information.  Furthermore,  since  regions  more 
closely  correspond  to  objects,  expectations  about 
the  appearance  of  objects  can  be  more  readily 
applied  to  the  regions  than  to  unorganized  pixels. 
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However,  segmentation  into  regions  is  not 
without  problems.  An  initial  process  of  segmenta¬ 
tion  normally  applies  a  single  criterion  of  homo¬ 
geneity  over  the  entire  image.  Unfortunately,  a 
difference  in  brightness  that  is  insignificant  in 
some  contexts  (such  as  a  fluctuation  in  a  textured 
background)  may  well  be  very  significant  in  other 
contexts  (such  as  part  of  the  border  of  an  object 
with  its  surroundings).  Most  segmentation  pro¬ 
cesses  take  little  account  of  this  sort  of  con¬ 
textual  information,  and  so  make  errors  of  two 
sorts:  over segmentation  and  under segmentation. 
Oversegmentation  breaks  into  pieces  what  should 
ideally  be  a  single  region.  Properties  of  the 
region  as  a  whole  (such  as  shape  and  area)  and 
relations  with  other  regions  (such  as  adjacency 
and  surroundedness)  cannot  be  computed  directly, 
but  can  only  be  recovered  by  attempting  to  merge 
pieces  that  are  likely  to  belong  to  the  same 
region.  More  serious  is  undersegmentation,  by 
which  several  regions  that  should  be  distinct  are 
fused  together.  Again,  region  properties  and 
relations  are  lost,  but  recovering  them  is  a  more 
difficult  business  of  attempting  to  split  the 
lused  region  into  parts. 

Several  attempts  have  been  made  to  overcome 
this  problem.  Tenenbaum  and  Barrow  in  IGS 
(Interpretation  Guided  Segmentation)  [4]  used  do¬ 
main  knowledge  to  guide  the  low  level  segmentation. 
Constraints  about  the  relationships  between  ob¬ 
jects  were  used  to  guide  the  merging  of  pixels  into 
regions.  Feldman  and  Yaklmovsky  [5]  also  used 
semantic  constraints  to  guide  segmentation. 

Another  approach,  used  by  Nagao  and  Matsuyama  (6] , 
first  performs  as  unguided  segmentation  and  later 
corrects  the  errors  in  this  segmentation  by  a 
semantically  controlled  process  of  merging  and 
splitting  regions.  We  assume  that  undersegmenta- 
tlon  never  occurs,  and  that  over segmentation  is  not 
serious:  that  an  object  is  at  worst  broken  into 
two  or  three  pieces.  We  augment  our  domain  model 
to  cover  fragments  of  objects,  but  without  making 
any  attempt  to  integrate  them  into  wholes.  For  the 
simple  domain  used  as  an  example,  the  initial  seg¬ 
mentation  can  usually  be  fine-tuned  by  hand  to  fit 
our  assumptions  above.  Even  so,  failures  are  not 
uncommon,  indicating  that  a  more  subtle  treatment 
of  segmentation  errors  is  needed. 

First  we  smooth  the  image  using  an  edge¬ 
preserving  smoothing  technique  in  order  to  reduce 
noise.  The  particular  technique  used  does  not 
matter  greatly,  but  usually  we  have  uBed  Narayanan 
and  Rosenfeld's  histogram-guided  smoothing  tech¬ 
nique  (71,  which  has  proved  quite  effective.  Next, 
we  requantize  the  lavage  into  a  small  number  of 
gray  levels  (typically  five),  following  the  peak 
structure  of  the  histogram  of  the  smoothed  image. 

After  this,  the  regions  themselves  can  be  ex¬ 
tracted  by  a  connected  components  analysis.  At  the 
same  time,  we  make  a  few  measurements  on  each 
region;  these  measurements  serve  as  a  description 
of  the  region  for  all  subsequent  processing.  We 
construct  the  bounding  upright  rectangle  around 
each  region  (see  Figure  2)  and  measure  the  image 
location  of  its  lower  left  corner,  its  width  and 


height,  as  well  as  the  area  and  average  brightness 
of  the  region  itself. 

As  mentioned  above,  these  measurements  provide 
only  a  crude  description  of  each  region,  but  suffi¬ 
cient  for  this  prototype  system.  A  full  implemen¬ 
tation  would  need  a  more  complete  description  of 
shape,  perhaps  the  chain  code  of  the  boundary  of 
each  region.  Since  any  region  description  is 
necessarily  incomplete,  it  may  ultimately  be  neces¬ 
sary  to  refer  back  to  the  original  image  to  check 
for  properties  that  cannot  easily  or  efficiently  be 
extracted  by  preprocessing  operations. 

3.  Constraint  filtering 

After  segmentation,  scene  analysis  becomes 
mostly  a  matter  of  labelling  the  regions  with  their 
identifications  as  objects  or  object  parts.  (For 
now  we  ignore  the  problem  of  organizing  the  parts 
of  objects  into  wholes.)  Clearly,  only  those 
labellings  are  valid  that  can  be  derived  from  an 
arrangement  of  real  objects  in  space.  Properties 
of  objects,  and  relationships  between  them,  imply 
corresponding  properties  and  relationships  of  the 
image  regions  that  result  from  these  objects. 

These  projected  properties  and  relationships  con¬ 
strain  the  possible  labelling  of  regions  with 
object  identifications.  Thus  scene  analysis  can 
be  reduced  to  a  constraint  satisfaction  problem. 

(The  early  work  of  Barrow  et  al.  [8,9],  used  this 
approach,  with  the  constraints  derived  from  a  re¬ 
lational  structure  which  provided  a  single  but 
inflexible  scene  model.) 

The  traditional  technique  for  solving  such 
problems  is  backtracking.  However,  backtracking  is 
inherently  a  sequential  technique,  which  does  not 
lend  itself  well  to  parallel  processing.  Even  if 
we  restrict  our  attention  to  sequential  processing, 
simple  backtracking  has  a  serious  defect,  espe¬ 
cially  on  the  problems  arising  in  scene  analysis: 

It  suffers  greatly  from  "thrashing"  behavior  [10, 
11,12].  When  a  failure  is  discovered,  only  the  most 
recent  labelling  is  reconsidered.  If  the  true 
cause  of  the  failure  lies  in  an  earlier  labelling, 
it  will  take  the  program  many  steps  of  blind  back¬ 
tracking  before  it  can  undo  the  incorrect  labelling. 
To  overcome  these  problems,  a  number  of  authors 
[10,11,14,15,16,17,18,21]  have  proposed  "constraint 
filtering",  "relational  consistency”,  or  "discrete 
relaxation"  techniques  for  constraint  satisfaction 
problems.  Some  have  emphasized  the  suitability  of 
these  methods  for  parallel  processing,  while  others 
have  stressed  the  avoidance  of  thrashing.  We  feel 
that  the  chief  advantage  of  these  methods  lies  in 
their  potential  parallelism,  especially  since 
Gaschnig  [13]  has  shown  that  more  sophisticated 
backtracking  methods  can  outperform  sequential 
implementations  of  constraint  filtering. 

In  order  to  perform  constraint  filtering,  it 
is  necessary  that  those  nodes  (regions)  that  con¬ 
strain  each  other  be  connected  in  a  network.  It  is 
at  least  theoretically  possible  for  the  labelling 
of  a  region  to  be  influenced  by  any  other  region  in 
the  image,  so  ideally  the  constraint  network  should 
be  a  complete  graph,  connecting  each  region  to  every 
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other  region.  But  in  practice  this  is  not  feasi¬ 
ble,  since  the  number  of  interconnections  grows  as 
the  square  of  the  number  of  regions  —  far  too  fast. 
Having  too  many  interconnections  is  undesirable  for 
two  reasons:  First,  it  inc  ^ases  the  cost  of  build¬ 
ing  a  hardware  network  tor  constraint  filtering 
(though  some  a-._..itectures,  such  as  that  of  ZMOB 
[19],  permit  arbitrary  Interconnection  at  no  extra 
hardware  cost).  Second,  and  more  important,  it 
increases  processing  time,  since  the  amount  of 
computation  done  for  each  region  is  roughly  pro¬ 
portional  to  the  number  of  regions  it  is  connected 
to. 

Therefore,  it  is  desirable  to  limit  the  num¬ 
ber  of  interconnections.  This  must  be  done  care¬ 
fully,  since  the  correctness  and  effectiveness  of 
the  constraint  filtering  depend  on  the  complete¬ 
ness  of  the  interconnections,  and  the  lack  of  a 
necessary  connection  might  prevent  or  mislead  the 
application  of  an  important  constraint.  We  would 
like  to  build  a  network  of  interconnections  that 
is  as  sparse  as  possible,  but  still  produces  the 
same  results  as  the  complete  graph.  Obviously, 
this  cannot  be  known  ahead  of  time — the  best  we 
can  do  is  to  connect  those  regions  that  have  a 
good  chance  of  being  relevant  to  each  other. 

This  is  a  matter  that  requires  much  further  inves¬ 
tigation,  but  for  now  we  have  implemented  a  simple 
notion  of  relevance:  A  region  is  connected  to  all 
regions  that  are  very  close  to  it  (because  these 
make  up  its  immediate  context),  and  to  all  very 
large  regions  in  the  image  (because  these  give  a 
good  basis  for  judging  it  in  its  global  context). 
Thus  the  number  of  interconnections  is  roughly 
constant  for  each  node  and  overall  is  proportional 
to  the  number  of  regions  in  the  image.  This  inter¬ 
connection  scheme  is  imperfect,  but  appears  to  work 
with  few  errors,  at  least  for  the  domain  of  FLIR 
images  used  in  this  report. 

Once  the  configuration  of  the  network  is  com¬ 
plete,  the  constraint  filtering  proper  can  begin*. 

We  first  of  all  attach  to  each  region  a  list  con¬ 
taining  all  the  labellings  that  it  might  possibly 
bear.  (Currently,  these  will  be  all  labellings 
possible  in  the  domain,  although  it  should  be 
possible  to  use  context  and  the  taxonomy  of  labels 
to  reduce  this  initial  list  considerably.)  Each 
label  has  associated  with  it  a  special  "wl  en- 
proposed"  procedure,  which  is  executed  for  each 
region  whenever  that  label  is  first  proposed  for 
the  region.  This  permits  the  calculation  of  cer¬ 
tain  parameters  that  make  sense  only  if  the  region 
is  Interpreted  as  a  particular  sort  of  object.  For 
example,  if  a  region  is  hypothesized  to  correspond 
to  an  object  of  a  certain  intrinsic  size,  then  it 
may  be  useful  to  use  the  region’s  apparent  size,  in 
conjunction  with  the  camera  geometry,  to  compute 
the  object’s  range  and  location  in  space.  Notice, 
however,  that  this  computation  makes  sense  only 
under  this  hypothesis. 

Next,  the  label  lists  are  filtered  using  what 
are  called  here  "unary  constraint a" .  That  is, 
knowledge  about  the  intrinsic  properties  of  objects 
is  used  to  eliminate  incorrect  labellings  from  each 
region's  label  list.  Regarded  as  propositions, 
the  unary  constraints  have  the  form:  "If  a  region 


Is  to  bear  this  label,  then  the  region  must  have 
these  properties".  The  constraint  is  actually  used 
in  the  contrapositive  form:  "If  a  region  does  not 
have  these  properties,  then  it  cannot  bear  this 
label."  These  properties  may  be  Immediate  proper¬ 
ties  of  the  region,  or  they  may  be  those  computed 
indirectly  by  the  appropriate  when-proposed  pro¬ 
cedure. 

Clearly,  these  three  steps  (hypothesizing 
labels,  computing  parameters,  and  f liter ing  labels) 
coulJ  be  done  at  one  swoop,  with  some  improvement 
in  computational  efficiency.  Here  they  are  done 
separately,  for  clarity  of  presentation  and  ease  of 
programming. 

After  the  region  labels  have  been  filtered,  we 
can  attach  to  each  interconnection  (or  arc)  a  list 
of  label  pairs  that  is  the  cross-product  of  the 
sets  of  labels  on  the  two  regions  at  either  end  of 
the  arc.  This  list  represents  the  joint  labellings 
that  are  simultaneously  possible  for  the  two  regions 
considered.  Then  all  these  label-pair  lists  can  be 
filtered  by  binary  constraints,  that  is,  those  joint 
labellings  can  be  eliminated  that  violate  a  con¬ 
straint  on  the  labelling  of  pairs  of  regions.  These 
constraints  have  the  propositional  form:  "If  two 
regions  (say  rx  and  T2)  are  to  simultaneously  bear 
the  labels  and  ^2  respectively,  then  and  r 2 
must  stand  in  certain  relations  to  each  other." 

Again  the  constraint  is  used  in  its  contrapositive 
form:  If  the  two  regions  fail  to  stand  in  the  re¬ 
quired  relations  to  each  other,  the  appropriate 
pair  of  labels  can  be  deleted  from  the  arc  joining 
them. 

Following  a)l  this,  three  more  filtering  pro¬ 
cesses  can  be  applied.  One  of  them,  filtering  by 
existential  constraints,  enforces  constraints  of 
the  following  form:  "If  a  region  is  to  bear  a 
certain  label  then  there  must  exist  other  regions 
that  have  certain  properties  and  stand  in  certain 
relationships  with  the  given  region."  This  is 
very  much  like  a  unaty  constraint,  except  that  the 
properties  of  the  other  regions  Include  the  require¬ 
ment  that  they  bear  certain  labels,  and  that  those 
labels  are  permitted  simultaneously  with  the 
labelling  to  which  the  existential  constraint  is 
being  applied.  Thus  existential  constraints  must 
examine  the  arc  labellings.  Unary  constraints 
need  be  applied  just  once,  but  since  the  allowable 
labellings  of  arcs  change  during  the  constraint 
processing,  an  existential  constraint  that  is 
satisfied  early  may  later  be  violated  because  a 
labelling  that  it  depended  on  has  been  rejected. 
Hence  the  filtering  by  existential  constraints 
should  be  redone  every  time  the  arc  labellings 
change. 

The  other  two  filtering  processes,  arc-upon- 
node  Interaction  and  node-upon-arc  interaction, 
attempt  to  enforce  consistency  between  the  node 
labellings  and  arc  labellings.  The  first  process 
ensures  that  every  node  labelling  has  support  from 
every  arc  that  impinges  on  it.  By  "support"  we 
mean  that  there  exists  on  each  arc  at  least  one 
label  pair  that  has  the  same  label  as  the  region 
for  its  first  or  second  component,  as  appropriate, 
depending  on  which  end  of  the  arc  the  node  lies  at. 
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If  a  node  labelling  lacks  support  it  is  deleted. 

The  second  process  ensures  that  every  arc  labelling 
has  support  from  the  nodes  at  either  end  of  It. 
Here,  by  "support"  we  mean  that  every  label  pair 
on  an  arc  should  have  as  its  first  element  a  label 
that  is  represented  In  the  labelling  of  the  node  at 
the  appropriate  end  of  the  arc,  and  have  as  its 
second  element  a  label  that  is  represented  in  the 
labelling  of  the  node  at  the  other  end  of  the  arc. 
Any  arc  label  that  lacks  such  support  is  deleted. 

These  three  interdependent  filtering  pro¬ 
cesses  provide  a  simple  but  effective  way  of  pro¬ 
pagating  inferences  about  the  identification  of 
regions  through  the  constraint  network.  They 
provide  the  system  with  a  rudimentary  form  of 
reasoning  about  scenes  in  the  sense  that  its  con¬ 
clusions,  if  Justified  logically,  would  take  sev¬ 
eral  proof  steps  from  the  given  axioms  (these  are 
the  region  properties  and  relations,  and  the  do¬ 
main  constraints).  Of  course,  all  this  reasoning 
is  done  by  a  mechanical  process  of  propagating  the 
effects  of  deleting  node  and  arc  labellings,  but  it 
can  be  regarded  as  a  limited  form  of  logical  de- 
duc  t ion . 

Notice  that  all  the  filtering  processes  work 
strictly  by  refuting  and  eliminating  labellings. 

This  means  that,  after  the  initial  labelling  gen- 
eration  processing,  all  the  filtering  processes 
could  be  run  Independently  and  asynchronously  on 
the  regions  in  any  order,  without  the  fear  of  race 
conditions  occurring.  That  is,  the  results  of  the 
constraint  filtering  will  be  the  same,  no  matter  in 
what  order  the  individual  filtering  processes  are 
applied  to  each  node,  provided  all  processes  are 
applied  until  the  network  stabilizes — when  no  fur¬ 
ther  deletion  of  labellings  can  be  made.  However, 
in  the  Interests  of  efficiency  and  simplicity  and 
in  order  to  simulate  an  actual  parallel  implemen¬ 
tation,  we  apply  the  various  processes  synchron¬ 
ously  in  parallel  over  the  entire  network.  As 
described  above,  we  first  perform  all  the  initial 
node  labellings,  next  all  the  node  filtering  by 
unary  constraints,  then  all  the  generation  of  Joint 
labellings  on  arcs,  followed  by  arc  filtering  by 
binary  constraints.  Now,  the  arc-upon-node  inter¬ 
action  and  the  existential  constraint  filtering  use 
the  arc  labellings  to  update  the  node  labellings; 
and  the  node-upon-arc  interaction  updates  the  arc 
labellings  using  the  node  labellings.  Therefore  it 
is  appropriate  to  apply  these  three  propagation 
processes  in  a  cycle  of  three  (in  the  order  given) 
repeatedly  until  the  network  stabilizes  (when  no 
further  deletions  of  labellings  can  be  made). 

After  the  constraint  filtering  has  stabilized 
and  terminated,  we  can  turn  our  attention  to  the 
interpretation  of  its  results.  Unfortunately, 
these  results  will  not  necessarily  be  correct  in 
the  sense  of  being  a  valid  solution  to  the  given 
constraint  satisfaction  problem.  Ideally,  we 
would  like  to  see  every  region  correctly  and 
uniquely  labelled  with  its  Identification  as  an 
object  or  object  part.  However,  given  the  way  we 
have  decomposed  the  problem  so  as  to  make  it 
amenable  to  parallel  processing,  such  an  outcome 
cannot  be  guaranteed.  Before  discussing  these 
erroneous  results  in  detail,  we  should  stress  that 


both  in  the  example  domain  used  here,  and  in  other 
domains  [17,20],  we  have  not  found  these  errors  to 
be  a  serious  problem  in  practice.  Other  authors 
[21],  report  similar  findings. 

One  sort  of  error  is  that  after  the  filtering 
a  region  may  retain  several  labels,  r.ot  just  one. 
This  situation  can  arise  from  two  causes.  First, 
it  may  well  be  that  there  is  more  than  one  valid 
solution  to  the  constraint  satisfaction  problem, 
that  is,  the  image  admits  of  several  distinct  inter¬ 
pretations.  So  a  given  region  may  have  more  than 
one  correct  identification  ascribed  to  it,  either 
of  itself,  or  in  conjunction  with  multiple  identi¬ 
fications  of  neighboring  regions  (neighboring  in 
the  sense  of  being  directly  connected  in  the  net¬ 
work).  From  one  point  of  view,  this  can  hardly  be 
considered  an  error:  a  region  can  have  several 
different  interpretations,  and  all  of  these  arc 
retained.  But  if  a  number  of  regions  all  have 
multiple  labels,  it  may  be  of  interest  to  dis¬ 
cover  which  unique  labellings  of  all  of  them  are 
simultaneously  possible,  and  this  sort  of  unrav¬ 
elling  cannot  be  done  by  mere  constraint  filtering. 
Related  to  this  is  the  second  cause  of  multiple 
labelling:  There  may  exist  in  the  network  an  am¬ 
biguity  that  can  be  resolved  in  principle,  but 
cannot  be  resolved  by  pairwise  constraint  filter¬ 
ing — its  resolution  requires  the  simultaneous  ex¬ 
amination  of  the  labellings  of  three  or  more  nodes. 
Errors  such  as  these  are  not  serious,  since  they 
tend  to  occur  infrequently — for  most  domains  it 
seems  that  pairwise  interaction  is  sufficient  for 
essentially  unambiguous  interpretation.  Even  when 
they  occur  they  can  easily  be  resolved,  by  some 
sort  of  backtracking  technique  alone,  or  in  combi¬ 
nation  with  further  constraint  filtering,  as  used 
by  Barrow  and  Tenenbaum  in  MSYS  [21] ,  and  by 
Harallck  and  Shapiro  [15,16] .  In  most  cases,  the 
bulk  of  the  disambiguation  will  have  been  done  by 
the  constraint  filtering,  leaving  very  little  work 
to  be  done  by  the  final  backtracking.  However,  we 
have  not  implemented  such  a  post -processing  phase 
for  the  current  system  because  our  main  Interest 
is  in  the  filtering  itself. 

It  is  worth  remarking  that  if  only  unary  and 
binary  constraint  filtering  are  used,  or  existent¬ 
ial  constraints  are  used  but  the  network  is  suffi¬ 
ciently  complete,  extra  labelling  (as  discussed 
above)  is  the  only  sort  of  error  that  can  occur. 

Under  these  circumstances  constraint  filtering  will 
be  safe  in  that  it  will  never  reject  a  correct 
labelling,  even  though  it  may  retain  some  incorrect 
labellings.  This  means  that  if  every  region  bears 
a  single  label,  then  we  can  be  sure  that  all  these 
labellings  comprise  the  unique,  correct  solution 
to  the  constraint  satisfaction  problem.  If  any 
regions  bear  multiple  labels,  then  we  know  that 
unique  interpretations  could  be  found,  if  neces- 
sary,  by  a  later  backtracking  process.  Unfor¬ 
tunately,  if  existential  constraints  are  used,  then 
fh*  required  completeness  of  the  network  cannot  be 
guaranteed  unless  it  is  a  complete  graph,  which  is 
seldom  feasible  in  practice. 

The  other  sort  of  error  that  can  occur  is  that 
a  correct  labelling  is  mistakenly  rejected.  As  im¬ 
plied  above,  this  can  only  happen  when  existential 
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constraints  are  applied  to  a  network  that  lacks 
some  necessary  interconnections.  Since  the  com¬ 
pleteness  of  the  interconnections  cannot  always  be 
determined,  ahead  of  time,  it  cannot  be  foreseen 
whether  such  errors  will  occur.  If  they  do  occur, 
they  they  are  irreparable,  since  a  label  once  lost 
cannot  conveniently  be  reinstated.  But  once  again, 
these  errors,  while  theoretically  possible,  have 
not  occurred  in  our  examples  because  our  simple 
configuration  rule  mentioned  earliet  ensures  suffi¬ 
cient  interconnection  in  the  network — at  least  for 
the  existential  constraints  used  in  the  example 
domain. 

Related  to  this  is  a  problem  that  occurs  if  a 
region  loses  all  of  its  labellings,  that  is,  all 
possible  identifications  of  it  can  be  refuted. 

This  means  that  the  region  is  unrecognizable  as 
anything  from  the  assumed  domain.  But  once  any 
node  in  the  network  is  unrecognizable,  its  effect 
will  be  propagated  until  all  nodes  lose  all  their 
labels.  Strictly  speaking,  this  is  perfectly  cor¬ 
rect:  If  a  scene  contains  objects  that  cannot  be 
recognized,  then  the  scene  could  not  possibly  be 
from  our  chosen  domain,  and  therefore  the  whole 
scene  is  essentially  unrecognizable.  The  problem 
is  that  the  constraint  filtering  implicitly  assumes 
that  the  set  of  labels  and  constraints  correctly 
account  for  everything  that  might  possibl y  appear . 

If  this  assumption  is  violated,  then  the  entire 
Image  must  be  rejected,  even  though  the  image  could 
be  successfully  interpreted  if  the  alien  object 
were  not  there.  While  theoretically  justifiable, 
this  behavior  is  undesirable  in  practice.  If  such 
a  vision  system  were  turned  loose  on  the  world,  we 
would  not  want  it  to  effectively  go  blind  every 
time  an  unexpected  object  chanced  into  its  field  of 
view.  One  solution  to  this  problem  is  to  postulate 
a  catch-all  label  for  which  there  are  no  constraints 
whatsoever.  Any  region  in  the  image  can  therefore 
bear  this  label,  even  those  that  are  otherwise  un¬ 
recognizable.  Of  course,  all  recognizable  regions 
will  also  bear  this  label  in  addition,  and  there 
will  be  numerous  additional  label  pairs  attached  to 
the  arcs.  This  will  cause  no  problem  with  the  in¬ 
terpretation,  but  it  does  Introduce  a  certain  com¬ 
putational  overhead  which  may  not  be  negligible. 
Another  solution,  which  does  not  suffer  from  these 
problems,  is  this:  When  a  node  loses  all  its  labels, 
it  should  be  marked  as  unrecognizable,  and  then  re¬ 
moved  from  the  network,  with  its  connecting  arcs  as 
well,  so  that  the  undesirable  effects  cannot  spread 
further.  Both  of  these  solutions  have  ramifications 
that  we  shall  not  go  into  here.  Because  of  this, 
and  because  the  problem  only  arises  when  the  model 
embodied  in  the  constraints  is  inadequate,  we  have 
not  made  any  special  provision  for  handling  it  in 
the  prototype  system.  If  any  region  is  found  to  be 
unrecognizable,  we  go  back  and  revise  the  model  to 
account  for  the  mlsrecognition. 

This  brings  us  to  one  final  matter:  How  are 
the  constraint  models  for  a  particular  domain  con¬ 
structed  in  the  first  place?  Winston  [22]  has  pro¬ 
posed  an  automatic  system  for  building  scene  models, 
that  uses  inductive  Inference  over  a  set  of  training 
examples.  Such  an  approach  is  certainly  possible 
here,  but  the  problem  of  aeparating  relevant  from 
irrelevant  features  can  be  expected  to  be  very  dif¬ 


ficult  for  all  but  the  simplest  scenes.  For  now, 
we  expect  that  models  will  be  built  by  hand.  A  user 
of  the  system  relies  on  his  own  introspection  and 
knowledge  of  the  domain  to  construct  an  initial 
model,  applies  it  to  some  well-chosen  examples, 
diagnoses  any  errors,  and  then  corrects  the  model 
accordingly.  For  many  applications,  this  is  a  quite 
acceptable  way  of  building  models. 

In  order  to  illustrate  the  matters  presented 
above,  we  now  >ive  examples  of  the  operation  of  our 
prototype  system  in  a  particular  domain. 

4 .  An  example  domain  —  TANKSWORLD 

In  order  to  test  out  the  ideas  described  in  the 
previous  sections,  we  have  implemented  a  prototype 
system  for  scene  analysis  using  constraint  filter¬ 
ing,  and  applied  it  to  a  domain  of  forward-looking 
infra-red  (FL1R)  Images  of  tanks  and  other  military 
vehicles  on  fairly  open  ground.  The  image  segmen¬ 
tation  and  region  extraction  programs  were  written 
in  the  programming  language  C.  The  constraint  fil¬ 
tering  system  was  written  in  LISP.  This  Includes 
the  constraint  filtering  procedures  themselves,  and 
also  a  number  of  auxiliary  procedures,  including 
those  that  provide  the  relational  primitives  out  of 
which  the  constraints  were  constructed.  The  con¬ 
straints  were  written  as  logical  expressions  in 
these  primitives,  using  special  conventions  to  mark 
the  variables  for  the  regions. 

Example  Images  from  this  chosen  domain  (dubbed 
"TANKSWORLD")  can  be  seen  in  Figure  3.  We  admit 
five  principal  region  labels  in  TANKSWORLD: 


GROUND 

(corresponding  to  the  ground  or  any 
patch  of  ground) 

SKY 

(the  sky  or  any  patch  of  sky) 

SMOKE 

(a  puff  of  smoke  or  similar  bright 
compact  object) 

TANK 

(a  tank,  or  any  vehicle) 

TREE 

(a  tree  or  shrub) 

Only  TREE  and  TANK  have  any  real  size  restriction, 
so  only  for  these  Is  over segmentation  a  problem. 

Therefore  we  provide  two  additional  labels,  TANK- 
FRAGMENT  and  TREE-FRAGMENT,  to  cover  pieces  of 
these  objects. 

In  general,  the  spatial  position  of  an  object 
cannot  be  determined  from  a  single  image.  All  that 
can  be  said  is  that  the  object  must  lie  along  a 
certain  line  of  sight.  However,  we  know  that  TANKS 
and  TREES  (we  use  the  labels  here  Informally  to 
stand  for  the  classes  of  objects  they  represent) 
must  stand  on  the  ground,  and  if  we  assume  that  the 
ground  is  an  approximately  level  plane,  we  can  use 
projective  geometry  and  a  simple  camera  model  in 
order  to  fix  their  actual  apatlal  locations,  and 
from  this  determine  their  ranges,  and  their  actual 
sizes  from  their  apparent  sizes  in  the  image.  The 
simple  camera  model  used  for  these  computations  is 
shown  in  Figure  4.  It  assumes  that  the  Image  Is 
formed  by  a  simple  pin-hole  camera,  with  known 
parameters.  For  objects  in  space  we  use  Cartesian 
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coordinates  x,y,z,  with  the  origin  on  the  ground 
vertically  below  the  camera's  pinhole.  The  x  axis 
runs  along  the  ground  to  the  right  from  this  origin, 
the  y  axis  directly  forward,  and  the  z  axis  verti¬ 
cally.  In  the  linage  we  use  coordinates  £  (horizon¬ 
tal)  and  r)  (vertical)  relative  to  an  origin  at  the 
center  of  the  field  of  view.  These  coordinates  are 
related  by 

1  _  _ 5 _ 

f  ycoso  -  (z-h)sln<c 


n  m  ysln<P  +  (z-h)cos4> 
f  ycogt  -  (z-hJsin'-P 

where  h  is  the  height  of  the  pinhole  above  the 
ground,  f  is  the  distance  from  the  pinhole  to  the 
film  plane,  and  <P  is  the  dip  angle  below  the  hori¬ 
zontal  of  the  optical  axis  of  the  camera.  These 
equations  give  an  adequate  approximation  for  any 
camera,  provided  the  field  of  view  is  not  too  wide. 
In  practice,  these  parameters  should  be  known.  For 
the  Images  used  here  they  were  not  known,  but  were 
estimated  by  taking  measurements  on  the  images  of 
objects  whose  size  was  approximately  known. 

So  for  the  labels  TANK  and  TREE,  we  have  a 
when-proposed  function  that  computes  spatial  loca¬ 
tion  on  the  ground  and  approximate  vertical  and 
horizontal  extent.  For  the  corresponding  fragments 
the  when-proposed  function  can  compute  only  bounds 
on  these  values,  but  these  bounds  are  nontheless 
useful. 

There  are  62  constraints  used  in  the  current 
model.  The  unary  constraints  are  used  to  enforce 
the  size  restrictions  on  TANKS,  TREES  and  their 
fragments,  limits  on  the  height  to  width  ratio  for 
TANKS  and  TREES,  and  restrictions  on  the  position 
of  SKY  and  GROUND  relative  to  the  horizon.  The  bin¬ 
ary  constraints  are  used  in  two  ways:  First,  to  ex¬ 
press  that  the  region  for  a  compact  object  such  as 
TANK,  TREE,  SMOKE  cannot  surround  the  region  for 
any  other  sort  of  object  (except  that  TANKS  and 
TREES  can  surround  their  respective  fragments) . 
Second,  to  enforce  restrictions  on  the  relative 
brightness  of  objects,  that  SMOKE  is  brighter  than 
anything  else,  TANK  is  not  brighter  than  anything 
else,  and  that  objects  of  the  same  class  have 
roughly  the  same  brightness,  except  for  GROUND  which 
has  considerable  variation.  (Notice  that  the  sys¬ 
tem  is  given  no  knowledge  of  the  absolute  bright¬ 
nesses  of  objects — only  relative  brightness  is  used. 
This  was  done  deliberately  in  order  to  demonstrate 
the  ability  of  the  constraint  filtering.)  Finally, 
the  existential  constraints  capture  the  requirement 
that  TREES  and  TANKS  must  rest  upon  a  piece  of 
GROUND,  and  that  a  fragment  of  an  object  must  have 
next  to  it  another  fragment  of  the  same  sort  such 
that  the  two  taken  together  do  not  exceed  the  size 
restrictions  for  the  iorresponding  whole  object. 

In  Figure  S,  we  show  some  typical  subimages 
from  this  domain.  Figure  6  shows  these  Images  after 
segmentation  with  boxes  drawn  around  the  regions. 
Because  of  memory  limitations  of  the  present  Imple¬ 
mentation,  regions  below  a  certain  size  were  ignored, 
and  interconnections  were  made  only  between  regions 


whose  boxes  were  immediately  adjacent  or  over¬ 
lapping,  and  to  the  one  or  two  largest  regions  in 
each  Image.  The  constraint-filtering  system  was  run 
on  these  examples,  and  the  results  are  presented  in 
Figures  7  and  8.  (For  each  label,  unambiguously 
labelled  regions  are  shown  in  white;  ambiguously 
labelled  regions,  which  bear  other  labels  as  well, 
are  shown  in  gray.)  In  all  cases  the  constraint 
filtering  stabilized  after  only  a  few  iterations  of 
the  propagation  processes. 

As  can  be  seen,  the  results  are  quite  good, 
especially  considering  the  noisiness  of  the  original 
Image,  and  the  blind  simplification  done  by  the 
segmentation  and  the  region  extraction.  A  number 
of  problems  with  the  results  are  worth  discussing, 
since  they  illustrate  limitations  of  this  approach. 

Since  there  are  so  few  constraints  on  GROUND, 
many  other  sorts  of  objects  will  retain  this  label. 
In  a  sense,  this  is  perfectly  unobjectionable.  In 
these  images  there  is  no  way  of  distinguishing  a 
tank  from  a  patch  of  ground  with  the  same  shape 
and  coloration  as  a  tank.  In  this  domain  the  TANK 
interpretation  is  more  likely,  but  the  constraint 
filtering  has  no  mechanism  for  expressing  prefer¬ 
ence  between  two  logically  irrefutable  labellings. 

Another  problem  is  that  because  there  is  often 
little  c'ontrast  between  sky  and  ground,  quite  a 
number  of  regions  straddle  the  horizon,  and  thus 
admit  both  the  labels  SKY  and  GROUND.  It  is  clear 
that  tne  segmentation  is  wrong,  but  the  current 
system  can  only  accept  uncritically  the  regions  it 
receives  from  the  segmentation.  A  more  sophisti¬ 
cated  system  could  attempt  to  modify  the  segmenta¬ 
tion  when  such  a  contradiction  was  detected.  In  a 
few  cases,  an  object  clear  to  the  eye  is  merged 
with  another  object  because  of  a  short  segment  of 
low  contrast  boundary  between  them  and  is  thus  lost 
altogether.  Detecting  and  repairing  such  a  mistake 
in  the  segmentation  is  really  quite  difficult. 

The  limited  context  provided  by  the  limited 
interconnection  of  regions  causes  some  difficulties. 
There  are  a  few  regions  that  retain  the  label  SMOKE, 
not  because  they  are  the  brightest  regions  in  the 
huge,  but  merely  because  they  are  brighter  than 
anything  they  are  connected  to.  In  some  other 
images  it  happens  that  a  cluster  of  TREE-FRAGMENTS 
support  each  other,  even  though  altogether  they  ere 
too  large  ot  too  small  to  comprise  an  entire  TREE. 
The  system  takes  into  account  only  the  pairwise 
interactions  of  the  fragments,  without  trying  to 
organize  them  into  a  coherent  whole. 

There  are  some  other  misldentif lcations  that 
can  be  blamed  on  the  simplified  shape  description 
used  here.  A  number  of  odd-shaped  regions  are 
labelled  as  TREES  just  because  the  regions  happen 
to  fix  boxes  of  about  the  right  size  and  shape, 
even  though  it  is  apparent  that  they  look  nothing 
like  TREES  in  their  actual  shape. 

Despite  these  problems,  it  la  clear  that  the 
constraint  filtering  can  accomplish  almost  all  the 
task  of  analyzing  these  scenes.  In  the  next  sec¬ 
tion,  we  will  discuss  some  of  the  Issues  raised  by 
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these  problems,  and  consider  ways  of  extending  the 
constraint  filtering  process  to  overcome  them. 

5.  Discussion 

We  have  seen  in  the  previous  section  that  con¬ 
straint  filtering  is  a  feasible  technique  for  scene 
analysis,  even  if  implemented  in  a  very  simple  way. 
We  will  discuss  here  some  extensions  to  this  tech¬ 
nique  that  would  overcome  most  of  the  shortcomings 
of  the  current  approach,  and  lead  to  a  more  powerful 
and  flexible  scene  analysis  system. 

One  straightforward  Improvement  would  be  to 
provide  a  more  accurate  shape  description  for  re¬ 
gions,  which  would  permit  more  realistic  computation 
of  region  properties  and  relations.  The  represen¬ 
tation  of  regions  by  boxes  is  convenient,  but  hardly 
satisfactory.  In  a  few  cases, a  spurious  adjacent 
or  surround  relation  will  hold  between  the  boxes' of 
two  regions,  when  in  fact  it  is  not  true  of  the 
regions  themselves.  This  can  lead  to  errors  in 
interpretation. 

It  would  be  desirable  to  provide  some  facility 
for  Indicating  preference  between  several  labels  for 
a  region,  all  logically  equally  possible,  but  one 
far  more  likely.  The  unavoidable  labelling  of 
TANKS  also  as  GROUND,  mentioned  in  the  previous 
section,  illustrates  this  problem.  More  generally, 
as  suggested  by  numerous  authors  [14,21,23],  it 
would  be  useful  to  attach  probabilities  or  confi¬ 
dence  measures  to  all  the  hypotheses,  properties 
and  relations  in  the  system,  and  provide  a  calculus 
for  combining  these  confidence  measures.  As  an 
example,  the  relation  same-brightness  just  checks 
that  the  difference  in  brightness  between  two 
regions  is  below  some  given  threshold.  In  most 
domains  this  is  unsatisfactory.  There  is  no  sharp 
cut-off  between  "same"  and  "not  the  same".  All  we 
can  say  is  that  the  greater  the  difference  in 
brightnesses  between  two  regions,  the  less  reason¬ 
able  it  is  to  regard  them  as  having  the  same 
brightness. 

Related  to  this  is  the  need  for  a  more  subtle 
combination  of  evidence.  A  certain  label  may  have 
a  number  of  constraints  applicable  to  it.  If  a 
certain  region  passes  all  but  one  of  these  con¬ 
straints  it  would  lose  that  label.  But  in  some 
circumstances  it  may  be  more  reasonable  to  suspect 
that  the  labelling  is  correct  but  that  some  error 
has  been  made  in  the  evaluation  of  the  failed  con¬ 
straint.  Perhaps  an  important  piece  of  evidence 
was  obliterated  by  noise,  occlusion,  or  poor  seg¬ 
mentation.  Ideally  a  scene  analysis  system  should 
be  able  to  tolerate  such  lost  evidence,  and  even 
attempt  to  recover  it  by  a  closer  re-examination 
of  the  original  Image. 

This  brings  us  to  the  matter  of  the  inter¬ 
action  between  the  scene  analysis  system  and  the 
image  data.  In  the  current  system  there  is  a 
strictly  one-way  flow:  segmentation,  then  analysis 
of  the  segmentation.  It  would  be  preferable  to 
have  a  mechanism  whereby  the  higher-level  analysis 
could,  under  certain  conditions,  call  for  a  re¬ 
examination  of  parts  of  the  original  image  in 
order  to  search  for  features  that  may  have  been 


lost  in  the  initial  processing.  The  scene  analysis 
system  should  also  be  provided  with  an  arsenal  of 
assorted  image  processing  routines,  in  addition  to 
segmentation,  in  order  to  capture  lines,  spots, 
and  other  features  that  are  likely  to  be  lost  dur¬ 
ing  segmentation. 

The  current  scheme  for  Interconnecting  regions 
into  a  network,  while  effective,  is  fairly  ad  hoc. 
It  should  be  possible,  by  an  analysis  of  the  con¬ 
straints,  to  make  a  more  rational  decision  about 
the  connection  of  regions.  A  region  may  need  only 
to  be  connected  to  regions  that  stand  in  certain 
relations  to  it  and  that  retain  certain  labels 
after  the  unary  constraint  filtering,  for  these  are 
the  only  regions  that  could  possibly  falsify  the 
applicable  constraints.  More  generally,  it  may  be 
advantageous  to  permit  the  reconfiguration  of  the 
network  during  processing,  although  this  would  have 
to  be  done  with  great  care  in  order  to  retain  the 
desirable  properties  of  constraint  filtering. 

One  deficiency  of  the  current  system  is  its 
clumsy  notation  for  expressing  constraints  that 
apply  to  a  number  of  labels.  As  mentioned  earlier, 
to  express  the  notion  that  SMOKE  is  the  brightest 
object  in  the  domain  we  must  provide  a  separate 
brighter-than  constraint  for  every  other  label  in 
the  domain.  This  could  be  overcome,  at  some  cost 
in  efficiency,  by  permitting  some  sort  of  quanti¬ 
fication  over  labels,  allowing  constraints  like 
"For  all  labels  £,  not  equal  to  SMOKE,  SMOKE  is 
brighter  than  £."  But  this  is  only  a  cosmetic 
change,  which  does  not  address  the  underlying 
problem.  The  current  systan  regards  all  the  labels 
as  being  quite  Independent  of  each  other.  This 
becomes  quite  a  serious  computational  inefficiency 
as  the  number  of  labels  becomes  large  (as  it  will 
for  any  realistic  domain),  especially  as  the  amount 
of  calculation  on  each  arc  is  roughly  proportional 
to  the  square  of  the  number  of  labels  in  the  domain. 
But  in  reality,  the  labels  in  a  particular  domain 
will  usually  show  certain  similarities  among  them¬ 
selves  and  share  many  constraints.  It  is  wasteful 
to  Independently  re-evaluate  for  each  label  these 
shared  constraints.  This  inefficiency  can  be 
naturally  and  effectively  overcome  by  organizing 
the  labels  of  a  domain  into  a  taxonomy  based  on 
similarity  and  shared  constraints.  The  system 
could  initially  propose  generic  labels,  which 
stand  for  whole  classes  of  objects,  and  test  these 
labels  by  applying  only  those  constraints  coupon 
to  whole  classes  of  objects.  Later,  when  no  fur¬ 
ther  progress  could  be  made  by  such  general  rea¬ 
soning,  the  generic  labels  could  be  replaced  by 
more  specialized  labels,  and  more  specialized  con¬ 
straints  could  be  applied.  For  example  in  TANKS - 
WORLD,  we  could  group  all  objects  that  must  lie 
below  the  horizon  into  a  single  class,  and  elimi¬ 
nate  this  class  label  from  all  regions  that  lie 
above  the  horizon.  Once  this  had  been  done  we 
could  specialize  objects  below  the  horizon  into 
classes  of  compact  and  extended  objects.  Later, 
the  compact  objects  could  be  subdivided  into  TANKS 
and  TREES.  If  necessaty,  TANKS  and  TREES  could  be 
further  classified  into  their  different  models  and 
varieties.  While  this  scheme  is  intuitively  clear, 
some  work  still  resMins  to  be  done  in  order  to  pro¬ 
perly  formalize  it,  especially  in  regard  to  the 
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Interactions  between  nodes  at  different  levels  of 
specialization. 

The  current  system  also  suffers  from  a  one- 
level  treatment  of  network  nodes.  It  is  possible 
for  a  cluster  of  regions  to  retain  the  label  TREE- 
FRAGMENT,  even  Chough  the  cluster,  considered  as  a 
unit,  looks  nothing  like  a  TREE.  What  is  needed  Is 
a  mechanism  for  creating  new  nodes  having  existing 
nodes  as  parts.  This  becomes  more  acutely  neces¬ 
sary  in  more  complex  domains  whose  objects  may  be 
built  up  from  distinct  parts.  Even  in  TANKSWORLD, 
at  slightly  better  resolution,  a  TREE  would  be  seen 
to  consist  of  a  trunk,  branches  and  foliage;  and  a 
TANK  would  show  wheels,  turret,  gun-barrel  and 
other  details.  Such  techniques  for  hierarchical 
constraint  filtering  have  been  studied  in  simpler 
domains  [24,25],  but  require  further  development 
for  more  complex  domains,  especially  if  they  are 
to  be  applied  in  an  efficient  manner. 

Recently,  Davis  [26]  has  shown  that  constraint 
filtering,  expressed  formally  in  logic,  can  be  re¬ 
garded  as  a  limited  form  of  inferencing.  This 
raises  the  possibility  that  more  powerful  forms  of 
constraint  filtering  could  be  devised.  Currently, 
these  techniques  work  by  falsifying  simple  hypo¬ 
theses  about  the  individual  and  joint  identifica¬ 
tions  of  nodes  in  network.  More  powerful  tech¬ 
niques  could  conceivably  reason  about  other  proper¬ 
ties  and  relations  between  nodes,  for  example, 
occlusion  relations  between  objects.  Formal  logic 
and  theorem-proving  would  also  provide  a  convenient 
means  of  treating  some  of  the  other  extensions  of 
constraint  filtering  described  above. 

In  conclusion,  we  have  shown  that  constraint 
filtering  is  an  effective  means  of  scene  analysis 
in  a  domain  more  complex  than  has  previously  been 
used  with  such  techniques.  The  deficiencies  of  the 
approach,  as  revealed  by  the  results  we  have  ob¬ 
tained,  have  suggested  a  number  of  Improvements 
and  extensions  to  constraint  filtering.  The  devel¬ 
opment  of  these  extensions  is  the  object  of  current 
research. 
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INTRODUCTION 


This  paper  presents  a  discussion  of  two 
applications  of  the  results  of  a  struct  al  texture 
analysis  system  [8].  The  details  of  the  basic 
analysis  procedure  will  not  be  presented  in  this 
report.  The  procedures  to  generate  basic 
descriptions  of  natural  textures  are  described  in 
[1-2].  The  extraction  and  description  of  texture 
elements  is  described  in  [3].  The  final 
determination  of  relations  (placement  rules)  and  the 
reconstruction  of  texture  patterns  is  presented  in 
[4-5].  a»rt  descriptions  of  these  programs  can 
also  be  fowd  in  [6-7], 


The  goal  of  the  symbolic  analysis  program  was 
to  produce  a  description  of  a  texture  pattern  v*iich 
is  similar  to  the  type  of  descriptions  produced  by 
human  observers.  The  description  is  based  on  the 
appearance  and  placement  of  elementary  texture 
primitives.  These  descriptions  can  be  used  for 
reconstruct, cn  of  the  pattern,  for  recognition  of 
the  input  pattern,  and  for  analysis  of  texture 
gradients  for  determination  of  surface  orientations. 
The  last  two  topics  are  described  in  this  report.  \ 

2.  TEXTURE  RBCOQJITION 


A  texture  recognition  algorithm  that  uses  the 
descriptions  generated  by  our  texture  analysis 
programs  is  discussed  below.  The  descriptions 
consist  of  the  periodicity  of  the  texture  and  the 
siae  and  shape  of  the  texture  elements.  Details  of 
descriptions  are  given  in  part  in  [1-3,6]  and  all 
details  may  be  foind  in  [8].  Hopefully,  the 
discussion  of  the  recognition  algorithm  below  is 
self-explanatory  in  terms  of  the  descriptions  used. 
The  recognition  scheme  is  basically  a  decision  tree. 
Eleven  types  of  textures  were  used  in  our 
experiments. 

Texture  Recognition  Algorithm 

The  structure  of  the  algorithm  used  for  texture 
classification  is  shown  in  Fig.  1.  The  texture 
classified  are:  floor  grating  (dark  dot  pattern) , 
brick  wall ,  aerial  view  of  city,  raffia  (woven 
palm),  herringbone  material,  wood  grain,  aerial  view 
of  water,  straw,  grass,  sand,  and  wool. 
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(tost  of  the  textures  used  are  from  the  Brodatz 
albun  [12].  The  exceptions  are  the  floor  grating, 
brick  wall,  and  aerial  city  patterns.  Aerial  city 
pictures  taken  at  different  orientations  and 
different  scales  were  used.  Only  11  samples  are 
found.  All  other  texture  groups  consist  of  16 
samples  each.  Pictures  of  both  shifted  and 
unshifted  brick  patterns  are  used. 

The  decision  tree  form  was  chosen  for  this 
texture  classification  scheme.  The  structure  of  the 
tree  allows  the  classification  information  to  be 
weighted  by  means  of  relative  ordering.  For 
example,  the  test  for  periodicity  is  encountered  at 
an  earlier  stage  than  any  of  the  aspect  ratio  tests. 
Clearly  texture  period  information  is  being  given 
more  significance  than  aspect  ratio  information 
according  to  this  classification  scheme. 

The  first  texture  characteristic  considered  is 
periodicity.  If  a  texture  is  non-periodic  it  will 
be  classified  by  the  subtree  to  the  right  of  the 
root  node.  However,  if  the  texture  exhibits  signs 
of  periodicity  but  fails  the  tests  for  each  of  the  5 
regular  texture  patterns  it  is  sent  back  to  the  root 
node,  re-labeled  as  a  non-periodic  texture.  Due  to 
this  loop  in  the  structure,  the  classification 
algorithm  does  not  have  the  exact  form  of  a  binary 
decision  tree.  However,  for  convenience  tree 
terminology  will  be  used  in  the  discussion.  The 
loop  is  introduced  to  accommodate  textures  which  are 
basically  non-periodic,  but  may  show  evidence  of 
periodicity  some  of  the  time.  The  wood  grain,  water 
and  straw  textures  exhibit  this  characteristic.  The 
opposite  is  not  as  likely  to  occur ,  i.e.,  the 
periodic  textures  are  not  mistakenly  sent  down  the 
non-periodic  branch. 

No  absolute  texture  element  dimensions  or 
intensity  values  are  used;  and  all  directional 
information  is  relative  as  well.  In  this  way,  the 
analysis  should  be  insensitive  to  scaling,  rotation 
and  degree  of  oontrast  within  the  image.  Measures 
which  are  used  are  primitive  eccentricity,  dimension 
to  period  ratio,  the  rxmber  and  significance  of 
texture  element  types,  and  relative  intensities  and 
orientations  of  texture  primitive  types.  The 
details  of  each  decision  box  for  the  two  main 
decision  branches  is  given  below: 

Periodic  Branch 

1)  Dark  found  Primitives  -  More  than  four  dark 
primitives  are  merged  into  one  primitive  type. 
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2)  Hiltiple  Sized  Primitives  -  Multiple  sized 
primitives  are  foirtd  in  two  perpendicular 
directions.  The  texture  period  is  equal  to  the  sun 
of  the  element  sizes,  lhe  sun  of  the  (size/period) 
ratios  for  the  four  most  significant  primitives  is 
less  than  .2. 

3)  Most  Significant  Primitives  are  Periodic  in  the 
Seme  Direction  -  the  two  most  significant  primitives 
exhibit  an  element  spacing  value  for  the  same  scan 
direction. 

4)  Large  (Dark  Element  Size /Light  Element  Size) 
Ratio  -  Hie  dark  primitive  is  at  least  twice  as  wide 
as  the  light  primitive  in  the  most  significant  scan 
direction. 

5)  Light  Elongated  Primitives  -  The  ratio  of  the 
dimension  in  the  most  significant  scan  direction  to 
the  dimension  in  the  direction  perpendicular  to  the 
direction  of  scan  is  less  than  .4. 

Non-Per iodic  Branch 

1)  Uni-Directional  Texture  -  The  two  most 
significant  primitives  are  found  in  the  same  scan 
direction.  The  significance  nunbers  associated  with 
these  two  primitive  types  are  greater  than  all  other 
significance  nunbers  by  at  least  .66  (out  of  a 
possible  1.0) . 

2)  Less  Elongated  Primitives  -  The  sun  of  the 
aspect  ratios  for  the  two  most  significant 
primitives  is  at  least  as  great  as  .175. 

3)  Most  Significant  Primitive  is  Relatively  Dark. 

4)  Elongated  Texture  Primitives  -  The  minimum 
aspect  ratio  of  the  four  most  significant  texture 
primitives  is  no  larger  than  .18. 

5)  Two  Primitives  Types  -  Primitives  are  found  for 
the  two  most  significant  (intensity, direct ion) 
pairs. 

6)  Gradual  Loss  of  Significance  -  There  is  no 
abrupt  loss  of  significance  after  the  fourth 
(intensity .direction)  pair.  The  difference  is 
smaller  than  .3. 

7)  Primitives  Pound  in  Two  Perpendicular 
Directions. 

8)  Low  Size/Spacing  Ratio  -  There  is  a  low 
(element  size/element  spacing)  ratio  for  relatively 
dark  primitives  in  two  perpendicular  directions. 
The  sum  of  both  ratios  is  less  than  .53. 

9)  Less  Gradual  Loss  of  Significance  -  The  loss  of 
significance  after  the  fourth  (intensity .direction) 
pair  {s  at  least  .3. 

No  attempt  has  been  made  to  optimize  this 
decision  scheme.  The  classification  results  are 
discussed  in  the  following  section. 


Classification  Results 

Classification  results  are  given  in  the 
confusion  matrix  shown  in  Table  1.  The  types  of 
samples  to  be  classified  are  listed  to  the  left  of 
the  matrix.  Each  row  shows  how  a  specific  set  of 
samples  was  classified.  For  example,  15  aerial 
water  texture  samples  were  correctly  classified, 
while  one  sample  was  incorrectly  classified  as  wood 
grain.  One  hundred  seventy-one  samples  were 
classified  in  all.  In  most  cases  the  samples  came 
from  512  x  512  pixel  texture  images.  These  were 
divided  into  sixteen  128  x  128  pixel  non-overlapping 
texture  subwindows.  However,  in  the  case  of  the 
aerial  city  samples  only  11  samples  were  available. 
These  were  cropped  from  two  different  satellite 
images  of  the  San  Francisco  area.  The  16  brick  wall 
samples  were  taken  from  three  separate  brick  wall 
images.  One  hundred  fifty-six  samples  were 
correctly  classified  to  give  an  overall  success  rate 
of  91.23%.  It  should  be  noted  that  additional 
contextual  information,  e.g.,  color,  scale,  and  type 
of  scene  would  probably  have  improved  the  results 
obtained.  However,  no  information  of  this  type  was 
used  in  the  classification  scheme  in  order  that  the 
strength  of  the  texture  descriptions  used  could  be 
tested  in  isolation. 

There  are  no  mismatches  for  the  highly 
structured,  regular  texture  group.  However,  there 
are  a  number  of  non-periodic  texture  samples  which 
are  classified  incorrectly.  Che  source  of  confusion 
is  between  the  water  and  wood  grain  textures.  Both 
are  one-dimensional  textures  made  up  of  elongated 
texture  primitives.  The  wood  grain  primitives  tend 
to  be  more  elongated  than  the  water  wave  primitives. 
There  is  little  else  which  is  noticeably  different 
from  a  structural  point  of  view.  Hence,  confusion 
of  these  two  texture  types  is  predictable. 
Additional  contextual  information,  e.g.,  the  scale 
information  for  both  textures,  would  improve  the 
classification  results. 

Another  area  for  confusion  is  the  set  of 
textures  consisting  of  (grass,  sand,  and  wool) 
exhibit  the  least  amount  of  structure  of  the  entire 
set.  The  edge  images,  eras,  era  descriptions, 
composite  texture  primitive  masks,  and  texture 
primitive  descriptions  are  very  similar  for  all 
three  types.  This  is  because  the  program  is  not 
designed  to  measure  the  types  of  features  vrtiich  most 
readily  differentiate  these  textures.  In  light  of 
these  description  similarities,  confusion  among 
members  of  this  group  is  to  be  expected.  It  should 
be  noted  that  none  of  these  texture  samples  were 
confused  with  textures  from  outside  of  the  group  and 
none  of  the  other  texture  samples  were  mistakenly 
classified  as  one  of  these.  The  same  can  be  said  of 
the  subgroup  formed  by  the  water  and  wood  grain 
textures. 

In  stannary  156  samples  out  of  171  were 
correctly  classified  to  give  an  overall  success  rate 
of  91.23%  for  this  classification  scheme.  These 
results  are  extremely  encouraging.  It  would  seem 
that  the  information  extracted  by  the  algorithms 
presented  earlier  in  this  thesis  describe  meaningful 
texture  characteristics. 
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3.  SURFACE  ORIENTATION  ANALYSIS 

In  Figs.  2  and  3  two  texture  gradient  images 
are  shown.  Figure  2  is  an  image  of  a  brick  wall  and 
Fig.  3  is  an  image  of  a  redwood  shake  roof,  in  each 
case  the  textured  surface  is  at  a  non  -  zero  angle 
with  respect  to  the  image  plane.  The  brick  wall 
recedes  to  the  left  of  the  image,  while  the  shake 
roof  slants  away  from  the  viewer  toward  the  top  of 
the  image. 

One  cue  which  we  use  to  infer  information  about 
the  orientation  of  textured  surfaces  is  the  texture 
gradient,  i.e.,  the  relative  change  in  size  or 
period  of  the  elements  making  up  the  textured  region 
within  the  image.  Assuming  that  the  textured 
surface  being  viewed  is  homogeneous,  the  element 
sizes  of  like  texture  primitives  should  decrease 
with  increased  distance  from  the  viewer.  Hie 
direction  of  maximum  rate  of  change  of  depth  with 
respect  of  the  observer  should  be  discernible  as 
well  as  the  degree  of  surface  slant  in  this 
direction.  A  convenient  way  to  represent  these  two 
surface  characteristics  is  via  the  gradient  space. 

Let  z  *  f(x,y)  be  a  function  defining  a  planar 
surface  in  3-Space,  where  the  image  plane  is 
parallel  to  the  x-y  plane,  (see  Fig.  4).  Hie 
surface  normal,  N,  can  be  defined  as  follows: 
N  =  (fjj.fy.-l).  letting  p  =  f„,  and  q  =  f,.,  we  have 
the  gradient  vector,  G  -  (p,q).  The  direction  of  G 
is  TAN-1  (q/p)  ,  vrtiile  the  magnitude  of  G  is 
SQRT(p  2  +  q2) .  Hie  direction  of  G  denotes  the 
direction  of  the  greatest  rate  of  change  within  the 
image,  vrfiile  the  magnitude  of  G  determines  its 
quantity.  A' so,  the  tangent  of  the  angle  that  the 
surface  makes  with  the  x-y  (or  image)  plane  is  equal 
to  the  magnitude  of  G.  Therefore,  the  orientation 
of  a  surface  in  3-Space  can  be  represented  as  a 
point  in  the  gradient  space. 

In  [10]  Stevens  suggests  an  alternate  set  of 
coordinates  to  represent  surface  orientation.  He 
suggests  the  pair  (o  ,t  ) ,  v*iere 

a  =  tan-1((p2+  q2)1/2), 

and 

t  =  tan  1  (q/p) . 


Stevens’  slant  and  tilt  angle  terminology  will  be 
used.  Hie  tilt  angle,  t,  is  an  angle  made  with  the 
horizontal  axis  of  the  image  plane.  It  is  the 
projection  of  the  surface  normal  onto  the  image 
plane.  Stevens  has  shown  that  this  direction 
coincides  with  the  direction  which  exhibits  the 
greatest  rate  of  change  of  distance  from  the 
observer  to  the  surface.  Hi  is  is  precisely  the 
direction  of  the  texture  gradient.  Therefore,  by 
calculating  the  texture  gradient  one  can  determine 
the  tilt  angle.  Hie  slant  angle,  a,  is  the  angle 
which  defines  how  much  the  image  plane  orientation 
differs  from  the  surface  plane  orientation.  For  a 
discussion  of  various  forms  of  surface  orientation 
representation  aee  the  works  of  Stevens  [10]  and 
Render  [13]. 


Calculating  the  tilt  angle  is  fairly 
straightforward.  It  entails  determining  the 
direction  of  the  gradient  of  any  texture  measure 
which  is  scaled  (due  to  distance) ,  foreshortened 
(due  to  surface  orientation) ,  or  both  scaled  and 
foreshortened. 

Determining  the  slant  angle,  o  ,  is  more 
involved.  Cue  method  suggested  is  to  determine 
vrtiich  texture  measure  corresponds  to  the 
characteristic  dimension.  That  is,  vrtiich  texture 
measure  is  scaled  but  not  foreshortened.  Hie 
normalized  gradient  of  this  dimension,  taken  in  the 
direction  of  the  texture  gradient,  is  equal  to  the 
tangent  of  the  slant  angle. 

Vd 

-g-  =  tan  o,  (1) 


where  d  is  the  characteristic  dimension. 
Characteristic  dimensions  are  parallel  to  the  image 
plane  and  are  perpendicular  to  the  local  surface 
tilt.  Therefore,  after  the  surface  tilt  has  been 
determined  the  orientation  of  the  characteristic 
dimension  is  known.  This  scheme  for  calculating 
cannot  be  used  if  the  image  is  an  orthographic 
projection,  or  if  the  elements  exhibit  successive 
occlusion.  Alternative  schemes  are  explored  in  [10] 
for  handling  these  problems.  Here  it  will  be 
assumed  that  neither  problem  exists. 

In  [9]  Bajcsy  presents  a  method  for  calculating 
the  angle  formed  by  the  image  and  surface  planes. 
However,  the  dimension  used  in  this  case  is  the 
dimension  oriented  in  the  gradient  direction. 
Therefore,  it  is  both  foreshortened  and  scaled. 


Using  the  principles  of  projective  geometry, 
the  trigonometric  rules  pertaining  to  similar 
triangles  and  some  small  angle  approximations, 
Bajcsy  derives  an  expression  for  ,  the  angle  formed 
by  the  surface  and  image  planes. 


-tan  a 

Focal  Distance 


Fractional  Change 
in  Element  Size 

Baseline  Image 


'2) 


where  the  fractional  change  in  element  size  and  the 
baseline  within  the  image  are  both  measured  in  the 
direction  of  the  texture  gradient.  Details  of  the 
derivation  can  be  found  in  [8]. 


Both  methods  assume  proximity  to  the  line  of 
sight  defining  the  local  image  plane.  All  of  the 
examples  used  in  the  following  section  have  maximum 
off  center  angle  less  than  10  degrees.  Further  work 
iS  necessary  to  define  the  transformation  needed  to 
correct  for  large  off  center  angular  separations. 
This  transformation  should  take  into  account  the 
image  plane  and  lens  system  characteristics  as  well 
as  the  geometry  needed  for  coordinate 
transformations. 


A  General  Orientation  Analysis  ?»chnique 

Application  of  our  structural  texture  analysis 
techniques  to  surface  orientation  analysis  problem 
is  in  a  preliminary  stage.  The  process  has  not  been 
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fully  defined  or  automated.  An  outline  of  a 
proposed  algorithm  is  discussed  below,  and  some 
preliminary  results  are  presented  in  the  following 
section.  Although  only  part  of  the  program  is 
operational,  see  2  below,  most  of  the  remaining 
program  sections  are  defined  and  seem  feasible  from 
a  programming  point  of  view. 

One  effect  which  must  be  anticipated  by  this 
technique  is  the  possible  changing  orientation  of 
the  texture  primitives.  If  the  texture  gradient  was 
strong  enough  the  height  of  the  brick  primitives  in 
Fig.  2  would  be  found  in  the  120  degree  scan 
direction  in  the  lower  left  hand  part  of  the  image 
and  in  the  vertical  scan  direction  in  the  right  and 
central  areas.  One  possible  solution  is  discussed 
in  (3)  below. 

A  possible  scenario  for  detecting  surface 
orientation  is  as  follows: 

1.  Divide  the  textured  region  into  locally  uniform 
subwindows,  i.e.,  subwindows  vAiich  exhibit  very 
little,  if  any,  element  size  variation.  this 
algorithm  is  not  yet  defined.  However,  as  a  first 
approximation  the  image  can  be  divided  into 
subwindows  which  accommodate  the  largest  texture 
elements.  Then  element  sizes  can  be  averaged  over 
each  texture  subwindow  to  produce  an  element  size 
for  that  location  in  the  image.  This  was  done 
manually  for  the  two  examples  discussed  below.  (In 
order  to  automatically  determine  the  largest  element 
size,  one  might  calculate  modified  ERAS  (see  2 
below)  for  a  large  range  of  distances,  say  i^)  to  one 
third  of  the  largest  image  dimension,  over  the 
entire  texture  image.  Then  an  ERA  interpretation 
routine,  similar  to  the  one  presented  in  [2],  can  be 
used  to  determine  the  largest  texture  element 
dimensions.) 

2.  Calculate  modified  ERAS  for  each  subwindow 
within  the  region.  Only  the  first  match  encountered 
will  be  recorded  for  a  particular  directional  scan. 
Hence,  no  element  size  or  spacing  repetitions  should 
be  counted.  It  is  hoped  that  this  will  prevent 
repetitions  from  being  interpreted  as  element  size 
variations  within  a  given  subwindow.  The  modified 
ERA  calculation  scheme  is  operational  and  has  been 
used  bo  produce  the  results  showi  in  Fig.  5. 

3.  look  for  a  dimension  exhibiting  strong  results 
for  each  sUbwindow.  A  dimension,  d,  which  is 
locally  uniform  but  which  reflects  the  gradient  of 
the  texture  is  sought.  It  should  be  verified  that 
all  of  the  ERA  results  chosen  refer  to  the  same 
texture  primitive  type.  One  way  to  achieve  this  end 
is  to  extract  texture  primitives  for  overlapping 
subwindows  and  verify  that  enough  of  the  primitives 
extracted  belong  to  both  subwindow  neighbors.  In 
this  way  textural  elements  which  shift  orientation 
due  to  the  effect  of  angular  perspective  can  be 
foind  in  each  subwindow. 

4.  Create  a  matrix,  M,  of  the  centroid  values  of 
the  element  size  peaks  for  d.  use  a  gradient 
operator  on  M  to  calculate  the  value  of  the  tilt,  or 
gradient,  angle  of  the  surface  within  the  image. 


5.  Blowing  the  tilt  angle  means  that  the 
orientation  of  the  characteristic  dimension  is  also 
known,  it  can  then  be  ascertained  if  either  the 
characteristic  dimension  or  the  gradient  dimension 
is  already  available  as  part  of  the  set  of  ERA 
results.  If  this  is  the  case  then  the  rest  of  this 
step  can  be  skipped.  If  this  is  not  true  then  the 
characteristic  or  gradient  dimension  of  some 
non-background  texture  primitive  type  must  be 
measured  from  texture  primitive  masks.  (These 
measurements  should  proceed  outward  from  the 
elemental  centers  of  mass.)  Either  of  these  two 
dimensions  can  be  used  for  the  slant  angle 
calculation.  Histograms  of  these  dimensions  should 
be  kept  so  that  the  final  calculations  can  be  made 
using  the  highest  amplitude/lowest  standard 
deviation  results.  When  measuring  the 
characteristic  dimension  the  original  set  of 
subwindows  can  be  used.  However,  then  calculating 
the  gradient  dimension  the  subwindows  might  have  to 
be  recropped  to  provide  the  correct  center  to  center 
angle. 

6.  At  this  point  either  Bg.  1  or  2  can  be  used  to 
calculate  the  surface  slant  angle  to  complete  the 
procedure.  If  the  characteristic  dimension  is  known 
then  Bq.  1  would  be  used.  If  the  dimension  oriented 
in  the  texture  gradient  direction  is  known  then 
Eq.  2  would  be  used.  Both  slant  angle  calculation 
methods  are  discussed  above. 

Texture  Gradient  Examples 

In  this  section  two  texture  gradient  examples 
are  presented  and  discussed.  They  make  use  of  the 
general  method  outlined  in  above. 

Example  _1 

Consider  the  brick  wall  image  in  Fig.  2.  This 
image  is  512  x  512  pixels.  It  was  divided  into  16 
128  x  128  pixel  subimages,  and  ERAs  were  calculated 
for  each  of  these.  The  vertical  element  size  ERAs 
for  each  stfcwindow  are  showi  in  Fig.  5.  As  expected 
the  element  sizes  decrease  toward  the  left  side  of 
the  image.  Figure  6 (a/  ihows  the  brick  vertical 
element  size  dimension  matrix,  in  Fig.  6(b),  the 
results  of  the  gradient  calculation  are  shown;  and 
Fig.  6(c)  shows  the  results  of  the  tilt  and  slant 
angle  calculations.  In  this  case  the  vertical  brick 
dimension  is  the  characteristic  dimension,  hence, 
the  method  developed  by  Stevens  will  be  used  to 
calculate  the  surface  slant.  The  slant  and  tilt 
calculations  are  carried  out  for  the  4  interior 
sitowindowe  of  the  image  (Fig.  2).  The  tilt  angle 
measured  from  the  image  is  approximately  3°.  The 
tilt  angle  results  range  from  .967°  to  4.52°. 
Uhfortunately,  the  actual  slant  angle  is  not  known 
for  this  image.  However,  the  results  found  for  this 
angle  seen  to  be  reasonable.  The  angle  made  by  the 
image  and  surface  planes,  or  similarly,  the  angle 
between  the  principle  ray  and  the  surface  normal 
seem  to  be  in  the  neighborhood  of  45°.  In  the  next 
example  the  approximate  angle  formed  by  the  surface 
and  the  image  plane  is  known. 


Example  2 

Consider  the  shake  roof  image  in  Fig.  3.  This 
image  is  512  x  512  pixels.  It  was  divided  into  9 
170  x  170  pixel  subimages.  (The  last  two  rows  and 
columns  were  not  used.)  ERAs  were  calculated  for 
each  of  the  nine  subimages.  The  vertical  element 
size  ERAS  for  each  subwindow  are  shown  in  Fig.  7. 
As  expected  the  element  sizes  decrease  toward  the 
top  of  the  image.  Figure  8(a)  is  the  matrix  of  the 
centers  of  mass  for  the  vertical  element  size  ERAS 
of  Fig.  7.  In  Fig.  8(b)  the  results  of  the  gradient 
and  tilt  angle  calculations  are  shown,  and  Fig.  8(c) 
shows  the  results  of  the  slant  angle  calculation  for 
three  image  locations.  The  gradient  direction 
dimension,  i.e.,  the  vertical  dimension,  of  the  wood 
shake  elements  was  already  known  via  ERA 
calculation.  Therefore,  the  the  surface  slant  angle 
was  calculated  using  the  method  developed  by  Bajcsy. 
(It  would  be  very  difficult  to  use  the 
characteristic  dimension  in  this  example  since  the 
widths  of  the  wood  shake  are  variable.)  The  slant 
angle  computations  are  carried  out  for  3  sets  of 
data.  The  tilt  angle  is  calculated  once  since  there 
are  only  enough  windows  for  one  gradient 
ccmputation.  The  tilt  angle  for  the  image  (Fig.  3) 
is  approximately  90°.  The  tilt  angle  was  computed 
to  be  88.49°.  The  slant  angle  was  calculated  to  be 
approximately  71.97°.  (See  Fig.  9.)  The  slant  angle 
calculation  results  range  from  72.405°to  74.1679 

Determination  of  window  size  is  a  problem  vhich 
must  be  addressed.  One  possible  solution  is  to 
calculate  a  set  of  ERAS  for  the  entire  image, 
initially,  for  a  wide  range  of  distances.  The 
maximum  element  sizes  found  would  then  dictate  the 
appropriate  window  size. 

4.  SUMMARY  AND  CONCLUSIONS 

Structural  texture  analysis  techniques 
previously  developed  were  applied  to  2  texture 
analysis  problems,  namely,  texture  recognition  and 
surface  orientation  determination.  The  results 
obtained  in  both  cases  appear  to  be  very  promising . 

First,  a  classification  scheme  using  both 
one-dimensional  texture  descriptions  and  texture 
primitive  descriptions  was  presented  and 
classification  results  were  discussed.  The 
algorithm  was  designed  to  classify  12  different 
types  of  textures,  including  6  random  and  5  periodic 
textures.  The  random  textures  were  taken  from  the 
Brodatz  album  [12).  These  are  grass,  sand,  wool, 
water,  wood,  and  straw.  The  5  periodic  textures 
came  from  a  variety  of  sources  ranging  from  aerial 
imagery  to  pictures  taken  by  the  author.  They  are 
raffia,  herringbone  material,  floor  grating,  aerial 
city,  and  brick  wall.  The  algorithm  achieved  an 
overall  success  rate  of  91.23%.  The  classification 
scheme  worked  well  for  the  non-periodic  texture 
group,  and  extremely  well  for  the  highly  structured 
regular  textures,  achieving  100%  correct 
classification  for  this  group,  m  those  cases  where 
there  was  confusion  additional  contextual 
information  would  have  been  helpful,  for  example, 
knowing  scale  information  would  have  aided  in 
distinguishing  wood  grain  from  the  aerial  water 
texture.  As  night  be  expected  the  amount  of 


algorithm  success  varied  directly  with  the  anount  of 
structure  present  in  the  texture.  Since  the  range 
of  textures  handled  by  this  algorithm  is  varied  and 
any  confusion  encountered  is  restricted  to  snail 
groups,  (2-3)  of  similar  textures  it  is  fair  to  say 
that  the  description  information  extracted  thus  far 
correspx>nds  to  meaning  textural  features. 

The  surface  orientation  determination  scheme 
presented  is  in  preliminary  form.  The  method 
described  is  only  partly  automated.  It  uses  schemes 
developed  by  Bajcsy  [9]  and  Stevens  (10).  It  also 
utilizes  the  techniques  discussed  in  earlier 
reports.  Some  preliminary  results  are  presented  and 
discussed.  Tilt  and  slant  angles  are  calculated  for 
two  images  exhibiting  non-zero  texture  gradients. 
The  results  for  all  known  quantities  are  accurate  to 
within  5  degrees.  These  results  appear  to  be 
promising.  However,  more  work  needs  to  be  done  to 
completely  automate  the  process. 
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(a)  Brick  Vertical  Dimension  Matrix.  Each  value,  d, 
is  a  dark,  vertical  element  size. 

S„  -  -23.146  S„  *=  21.778 

x  (1)  x  (2) 

Sy  ■  -.83  Sy  =  -1.094 

Sv  =  -23.566  S„  **  -22.577 

X  (3)  x  (4) 

Sy  =  -1.864  Sy  =  -.381 

(b)  Gradient  Results  for  (a). 


(c)  Tilt  and  slant  Angle  Matrices. 

Figure  6.  Analysis  for  Example  1. 


Figure  7.  Vertical  Brick  Element  Size  Graphs  for 
Example  2. 
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\  '  ABSTRACT 

Image  focus  is  a  simple  function  of  distance  to  tbe  imaged 
point  (an  effect  commonly  referred  to  as  depth  of  field)  and  the 
parameters  of  the  lens.  Therefore,  by  measuring  focus  we  can 
determine  distance  to  the  imaged  object.  Two  algorithms  are 
presented.  Tbe  first  determines  iens-to-object  distance  at  points 
of  discontinuity  in  tbe  scene,  using  one  image  acquired  with  a 
standard  lens  system.  Tbe  second  algorithm  determines  distance 
for  all  regions  of  the  image  that  contaib  high-frequency  informa¬ 
tion,  using  a  single  view  acquired  with  a  special  lens  system.  The 
distance  estimate  produced  by  the  second  algorithm  is  overeon- 
strained;  thus,  the  accuracy  of  the  estimate  can  be  checked  inter¬ 
nally.  These  results  show  that  single-view  measurement  of  focus 
provides  information  comparable  to  measurement  of  stereo  dis¬ 
parity  or  motion,  while  avoiding  image- to-im age- matching  prob¬ 
lems.  A  simple  experiment  is  described  showing  that  focns  in¬ 
formation  is  important  in  human  perception. 


1.  Introduction 

Most  lens  systems  are  exactly  focused*  at  only  one  distance 
along  each  radius  from  the  lens  into  the  scene.  The  locus  of 
exactly  focused  points  forms  a  doubly  curved,  approximately 
spherical  surface  iu  three-dimensional  space.  Only  when  objects 
in  the  scene  intersect  this  surface  is  their  image  exactly  iu  focus; 
as  tbe  distance  between  the  imaged  point  and  the  surface  of 
exact  focus  increases,  tbe  imaged  objects  become  progressively 
more  defocused.  This  change  in  focns  as  a  function  of  distance 
is  known  as  depth  of  field. 

The  amount  of  defoeus  or  blurring  depends  solely  on  the 
distance  to  the  surface  of  exact  focus  and  the  characteristics  of 
tbe  lens  system.  Therefore,  if  we  could  measure  tbe  amount 
of  blurring  at  a  given  point  in  tbe  image,  we  could  use  our 
knowledge  of  tbe  leas  system  to  compute  the  distance  to  an 
imaged  point  in  tbe  scene. 

The  distance  D  to  an  imaged  point  is  related  to  the 


**Exoct  focns*  h  taken  here  to  mean  ‘has  tbe  minimum 
variance  point  spread  function,*  the  phrase  "measurement  of 
focus*  in  taken  to  moan  "characterise  the  point  spread  func¬ 
tion.* 


parameters  of  the  lens  system  and  the  amount  of  defocus  by  the 
following  equation,  which  is  developed  in  the  appendix. 


D  — 


Fvo 

vo-F-of 


(1) 


where  Vf,  is  the  distance  between  the  leas  and  the  image  plane 
(e.g.,  the  film  location  in  n  camera),  /  the  f-aumber  of  the  lens 
system,  F  the  focal  length  of  the  lens  system,  and  o  the  spa¬ 
tial  constant  of  the  imaged  point's  "blur  circle."  Image  blurring 
is  described  by  the  point  spread  function,  which  is  perhaps  ap¬ 
proximated  best  by  a  two-dimensional  Gaussian  G(r,<r)  with  a 
spatial  constant  o,  which  may  he  parameterized  by  radial  dis¬ 
tance  r  as  follows: 

*a**(‘£)  18 

The  use  of  a  Gaussian  to  describe  the  point  spread  function  is 
discussed  in  tbe  appendix. 

In  most  situations,  the  only  unknown  on  the  right-hand 
side  of  Equation  (1)  is  r,  tbe  point  spread  function’s  spatial 
parameter.  Thus,  we  can  nse  Equation  (1)  to  solve  for  absolute 
distance  given  only  that  we  can  measure  <r,  i.e.,  the  amount  of 
blur  at  a  particular  image  point. 

I  shall  present  two  algorithms  that  estimate  absolute  dis¬ 
tance  through  measurement  of  o.  The  first  algorithm  uses  a 
standard  lens  system,  while  the  second  nses  n  specially  designed 
lens  system.  Both  methods  require  only  one  view  of  the  scene. 


1.1  Images  Acquired  With  A  Standard  Leas 

Image  data  are  determined  by  scene  characteristics  and  the 
properties  of  the  lens  system.  For  example,  tbe  rate  of  change 
in  image  intensity  at  n  point  is  dependent  upon  both  tbe  rate 
of  change  in  scene  radiance  and  the  focus  of  tbe  lens  system. 
Therefore,  if  we  are  to  measure  tbe  foeus  it  seems  that  we  must 
already  know  the  scene  characteristics  —  a  generally  unrealistic 
requirement. 

There  is,  however,  one  special  case  in  which  we  can  know  the 
scene  characteristics  with  a  considerable  degree  of  confidence, 
i.e.,  the  case  when  step  discontinuities  occur  in  the  image  forma¬ 
tion  process.  Such  discontinuities  give  rise  to  image  data  that 
are  reliably  recognizable,  as  will  be  explained  below.  Became 
tbe  scene  parameters  change  very  rapidly  at  a  step  discontinuity, 
tbe  rate  of  change  we  observe  in  the  image  is  due  primarily  to 
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the  point  spread  function.  Thus,  at  sharp  discontinuities  we  can 
use  the  image  data  to  determine  the  focus.  These  observations 
lead  to  the  following  algorithm  for  recovering  the  absolute  dis¬ 
tance  D  at  points  of  discontinuity  in  an  image  acquired  with  a 
conventional  lens. 

Step  (1).  Find  points  of  step  discontinuity  in  the  image  for¬ 
mation  process.  We  may  identify  such  points  by  finding  the 
tero  crossings  of  C(x.f),  the  convolution  of  the  Laplacian  of  a 
Gaussian  with  the  image  intensity  values  f(z,g)  [1],  i.e.,  the  tero 
crossings  of 

C(z,p)  —  V*<?(r,<r)  ®  /(*,») 

which  have  symmetric  absolute  values  about  the  zero  point  and 
which  have  nonzero  change  in  image  intensity  across  them.  The 
following  example  shows  that  such  nontrivial,  symmetric  zero 
crossings  correspond  closely  to  step  discontinuities  in  the  image 
formation  process. 

Consider  a  step-discontinuity  edge  in  the  image  of  mag¬ 
nitude  k  in  the  x  direction  at  position  (xo,yo),  as  defined  by 


ft 


if  x  >  z0; 
if  z  <  z0. 


In  this  case  the  convolution  values  C(z,y)  have  the  form 

C(z,p)-  V2C(r,<r)<g>S(x,f) 

-  j  f  V2C(  >/(x  -  Up  +  (»-»)*,  <r)S(u,  v)dudv 
—  MdG(z  -  z0,  o)/dx) 


where  G(x  —  xQ,a )  is  a  one-dimensional  Gaussian  centered  at 
point  zo,  and  the  constant  a  is  the  spatial  constant  of  the  point 
spread  function  at  that  point  in  the  imag ..  Thus,  zero  crossings 
with  this  functional  form  and  with  i>0  occur  at  points  of  step 
discontinuity  in  the  image  formation  process.  It  is  unusual  to 
find  such  nontrivial,  symmetric  zero  crossings  at  other  points  in 
the  image  [2|. 


Step  («).  Calculate  the  spatial  constant  of  the  point  spread 
function.  The  above  example  shows  the  form  of  the  convolution 
values  across  a  step  discontinuity  for  any  given  state  of  focus. 
Thus,  we  can  use  this  model  of  the  image  values  to  measure  the 
focus  (i.e.,  estimate  at  points  of  step  discontinuity. 

A  maximum-likelihood  estimate  of  o  can  be  formed  as  fol¬ 


lows: 

C(z,») 


.dC(z,o) 

dx 


(3) 


where  z,  y  and  *  are  as  before,  and  for  convenience  ze  is  taken 
to  he  zero.  Taking  the  absolute  value  and  then  the  natural  log, 

we  find 


Wfc  can  formulate  Equation  (4)  an 


(«) 
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Az*  +  B-C 


Figure  1  Special  Lens  System  Using  Depth  of  Field  to  Estimate 
Distance.  The  light  from  n  single  view  is  split  into  two  identical 
images,  using  a  half  silvered  minor.  The  images  are  thea  directed 
through  two  lens  systems  with  dHlerent  aperture  size.  The  two  result¬ 
ing  images  air  identical  except  for  apeturt,  and  thus  object*  not  on 
the  exaet-focus  surface  have  different  amounts  of  defocus.  Distance 
can  be  computed  by  comparing  the  two  point  spread  functions  at  each 
point  in  the  images. 


where 


A  —  - 


& 


B- In 


y/i zw* 


C-ln| 


z 


If  we  interpret  Equation  (6)  as  n  linear  regression  in  z*  we  can 
then  obtain  a  maxi  mom- likelihood  estimate  of  the  constants  A 
and  B,  and  thus  obtain  a  maximum-likelihood  estimate  of  r. 
The  solution  of  this  linear  regression  is 


, _ ~  **)£<  -  *  ji. 


(*) 


where  I  is  the  mean  of  the  Zi,  and  C  is  the  mean  of  the  Ci-  From 
A  we  can  obtain  the  following  maximum-likelihood  estimate  of 
the  value  of  the  spatial  constant  V 


Stop  (I).  Calculate  the  distance.  Once  we  have  obtained  r 
for  each  symmetrical,  nontrivial  zero  crossing  xint,  we  may 
employ  Equation  (1)  to  find  the  distance  to  the  imaged  point.  An 
example  of  this  algorithm  applied  to  a  natural  image  m  shown 
after  the  next  section. 


li  Images  Acquired  With  A  Spado!  Loan  System 

The  limiting  factor  in  the  pterions  method  is  the  require- 
mM  that  we  must  know  the  scene  characteristics  before  we  can 
measure  the  foene.  This  requirement  restrict*  the  applicability  of 
the  method  to  certain  special  potato,  such  as  step  discos  tixaitisc. 
If,  however,  we  had  two  images  of  exactly  the  mme  scene,  bat 
with  diSsrcnt  depth  ef  field,  we  could  factor  ont  the  coatrihntina 
of  the  scene  to  the  two  images  (a*  the  contribution  is  the  same), 
and  meaeare  the  bens  druetfy. 


The  lens  system  shown  in  Figure  1  takes  n  single  view  of 
the  scene  and  produces  two  images  that  are  identical  except  for 
aperture  size  —  and  therefore  depth  of  field.  This  lens  system 
uses  a  half-silvered  mirror  (or  comparable  contrivance)  to  split 
the  original  image  into  two  identical  images,  which  are  then 
directed  through  lens  systems  with  different  aperture  site.  This 
results  in  two  images  that  are  identical*  except  for  aperture.** 
The  difference  in  aperture  results  in  differing  depth  of  field,  and 
thus  imaged  points  will  be  focused  differently  in  the  two  images. 

Because  the  two  images  are  identical  except  for  aperture 
site  they  may  he  compared  directly;  i.e.,  theic  is  no  matching 
problem  ss  there  is  with  stereo  or  motion  algorithms.  We  can 
then  recover  the  absolute  distance  D  by  comparijg  the  focus  in 
the  two  images,  as  described  in  the  following. 

Step  (I).  Determine  ay,  the  point  spread  function  spatial  con¬ 
stant  for  the  first  image,  and  <r2,  the  point  spread  function  spatial 
constant  for  the  second  image. 

We  start  by  taking  a  patch  fi(r,t)  of  /|(*.  y),  the  first 
image,  centered  at  (zo.Po), 


we  see  that  we  may  use  Equation  (7)  to  derive  the  following 
relationship  7y  and  7t  (the  Fourier  transforms  of  image  patches 
/i  and  ft)  and  7a  (the  transform  of  fa)' 


Ji(X,f) 

7t(Kt) 


y/i*Oy 

y/irat 


(8) 


Thus* 


y,(X)  C(X,w,)<r2 

7j(X)  G(X,wj)oi 


^exp^JwVS-w*)) 


<•) 


where 

/(X)-jT  7{\,t\U 

Thus,  given  7\  and  7%  we  ran  find  a ,  and  at,  as  follows.  Taking 
the  natural  log  of  Equation  (9)  we  obtain 


/i(r,  #)  —  /i(jo  +  r  cos  I,  go  +  r  sin  #) 

and  calculate  its  two-dimensional  Fourier  transform  /|(X,I). 
The  same  is  done  for  a  patch  /2(r,  I)  at  the  corresponding  point 
in  the  second  image,  giving  us  %(X,#).  Note  that  there  is  no 
matching  problem,  as  the  images  are  identical  except  for  depth 
of  field. 

Now  consider  the  relation  of  f\  to  ft-  Both  cover  the 
same  region  in  the  image,  so  that  if  there  were  no  blurting  both 
would  be  equal  to  the  same  intensity  function  /o(r,f).  However, 
because  there  is  blurring  (with  spatial  constants  ay  and  w2),  we 

— /o(r,#)®G(r,Wi)  ... 

/a(r,#)-A(r,#)®G(r,o2) 

One  point  of  caution  here  is  that  Equation  (7)  may  be  sub¬ 
stantially  in  error  in  canes  with  a  large  amount  of  defocua,  as 
points  neighboring  the  patchs  fy,  ft  will  be  ‘spread  out*  into 
the  patches  by  differing  amounts.  This  problem  can  be  avoided 
by  using  patches  whose  edges  trail  off  smoothly,  e.g., 


2 

In  +  X*2r*(o|  -  <r*)  —  In  7i(X)  -  In  72(X) 

*i 

We  may  formulate  this  as 

XX*  +  B  —  C 


wheie 

A  -  -  <T|)  B-ln^|  C-ln7i(Xj-U>3t(X) 

i.e.,  as  a  linear  regression  equation  in  X1.  The  solution  to  this 
regression  equation  is  the  same  as  shown  in  the  last  example, 
and  gives  us  maximum-likelihood  estimates  of  A  and  B.  Solving 
A  and  B  for  Oy  and  e2  yields 

“  / 2*2(<®^1)  ^“^(Is  -lj  (10) 


ft  (r.  *)  —  /(*o  +  r  cos  #,  m  +  r  sin  #)G(r, «) 

for  appropriate  spatial  parameter  w.  Noting  that 

f{r,  *)-€-"*  F(X,  #)-«-'»• 

are  a  Fourier  pair  and  that  if  f\r,l)  and  f(X,<)  arc  a  Fourier 
pair  then  so  are 

/(or,#)  ^.D  . 

’There  will  be  a  overall  brightness  difference;  this  can  be 
removed  by  multiplying  the  intensity  mess  are  meats  from  one  of 
the  images  by  the  inverse  of  the  ratio  of  the  mean  brightnesses. 
** Alternatively,  one  may  might  change  the  focal  length  instead 
of  the  aperture  rise.  The  technique,  mathematics,  and  result 
remain  the  some. 


Step  (S).  Use  the  estimates  of  ay  and  <r2  from  Equation  (10)  to 
calculate  absolute  distance  to  the  imaged  surface  patch.  Using 
Equation  ( 1 )  for  each  of  the  two  images,  we  see  that  we  now  have 

D _ _  D _ ** _ 

va-F-Oyf,  to -F-otft 

where  fy  and  ft  are  the  f-numbers  for  the  two  halves  of  the 
imaging  system.  We  may  solve  either  of  these  two  equations  for 
D,  the  distance  to  the  imaged  surface  patch.  The  solution  is 
overroast  rained;  hath  solutions  mast  produce  the  same  estimate 
of  distance  —  otherwise  the  estimates  of  Wi  and  w2  must  be  in 
error.  This  can  occur,  for  instance,  when  there  is  insufficient 
high-frequency  information  in  the  image  patch  to  enable  the 
change  in  focus  to  be  calculated. 

’Note  that  we  need  or1*-  onsider  the  amplitude  of  the  trans¬ 
forms  in  these  calculate.  . 
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Figure  2  An  Indoor  Image  of  s  Sand  Caitle,  Refrigerator,  and  Door. 


•hows  all  the  symmetric,  nontrivial  tero  eroaainp.  Part*  (b),  (c), 
and  (d)  show  points  on  three  tero  eroaainp  that  have  larp,  medium 
and  email  depth  valuee,  respectively.  Cloec  examination  of  three 
figure*  will  ahow  that  the  depth-from-foeue  algorithm  la  indeed  work¬ 
ing  properly,  although  there  la  variability  ir  the  cellmates. 

2.  Evaluation 

The  I  ret  method  of  deriving  depth  from  focus,  which  aaes 
s  standard  lens  system,  was  implemented  in  a  straightforward 
manner,  and  evaluated  on  the  image  shown  in  Figure  2.  The 
second  algorithm  haa  not  yet  been  evaluated,  because  an  ap¬ 
propriate  image  pair  has  not  yet  been  acquired. 

Figure  3  shows  the  depth  eat i mates  which  were  obtained 
when  the  staadard-leoa  algorithm  was  applied  to  the  image  of 
Figure  2.  Part  (a)  of  this  Figure  3  ebons  all  the  symmetric, 
nontrivial  tern  crossings,  in.,  identiled  points  of  step  discon- 


Figure  4  Depth  Estimates  for  Zero-Crossing  Segments.  Part  (a)  of 
this  figure  shows  all  the  sy  mmetric,  nontrivial  irro-erossing  segments. 
Parts  (b),  (c),  and  (d)  show  the  tero-crossing segments  that  have  large, 
medium  and  small  depth  values,  respectively,  it  can  be  seen  that  the 
image  is  properly  segmented  with  respect  to  depth,  with  the  exception 
of  one  small  segment  near  the  top  of  (c).  This  mistake  could  be 
remedied  by  a  segmentation  procedure  which  is  tolerent  of  positional 
quantisation  errors  of  the  location  or  the  tero  crossings. 

linuity .  Parts  (b),  (c).  and  (d)  ahow  points  on  these  zero  crossings 
that  have  large,  medium,  and  small  depth  values,  respectively. 
Close  examination  or  these  figures  will  show  that  the  deplh-from- 
focus  algorithm  is  indeed  working  properly,  although  there  is 
variability  in  the  estimate.  This  variability  may  have  resulted 
fron)  the  substantial  noise  which  was  present  in  the  digitized 
image  values,  or  from  using  too  simple  an  approximation  to  for 
the  point  spread  function  (this  is  discussed  in  the  appendix). 

One  method  of  minimizing  this  variability  is  to  average  the 
depth  values  along  tero-crossing  segments.  Such  averaging,  of 
course,  makes  the  implicit  assumption  that  depth  values  along 
the  contour  vary  smoothly.  To  achieve  useful  averaging,  there¬ 
fore,  zero-crossing  contours  were  segmented  at  points  of  high  lo¬ 
cal  curvature  [2],  and  the  average  depth  was  computed  for  each 
segment.  The  results  of  this  procedure  are  ahown  in  Figure  4. 

Part  (a)  of  Figure  4  shows  all  the  aymmetrie,  nontrivial  aero- 
crossing  segments,  i.e.,  contour  segments  identified  as  discon¬ 
tinuous.  Pans  (b),  (e),  and  (d)  ahow  the  zero-croasing  segments 
with  large,  medium  and  small  depth  values.  It  can  he  seen  that 
the  image  is  properly  segmented  with  respect  to  depth,  with  the 
exception  of  one  small  segment  near  the  top  of  (c).  This  mistake 
eould  he  remedied  by  a  segmentation  of  the  tero  eroaainp  which 
makes  allowances  for  positional  quantitation  error. 
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3.  Discussion 

The  most  striking  property  possessed  by  these  two  algo¬ 
rithms  is  that  absolute  depth  can  be  recovered  from  a  single  view 
with  no  image-to-image  matching,  the  major  problem  in  stereo 
and  motion  algorithms.  Furthermore,  no  special  scene  charac¬ 
teristics  need  be  assumed,  so  that  the  techniques  are  generally 
applicable. 

The  normal-lens  algorithm  appears  to  have  a  potential  for 
accurate  depth  estimates  that  is  comparable  to  edge-  or  feature- 
based  stereo  and  motion  algorithms  (e.g.,  (3),  [4]  or  (5|).  Earn  of 
these  algorithms  is  able  to  recover  scene  depth  at  certain  feature 
points  given  a  camera  model.  The  major  difficulty  in  using 
stereo  and  motion  algorithms  is  identification  of  the  same  point 
in  successive  views,  whereas  the  major  difficulty  in  measuring 
focus  at  discontinuities  is  identification  of  the  discontinuities; 
once  this  has  been  accomplished,  the  rest  is  straightforward.  We 
hare  suggested  one  technique  for  identifying  discontinuities,  but 
the  possibility  of  better  methods  is  not  excluded. 

The  special-lens  algorithm  provides  considerably  stronger 
information  about  the  scene  because  it  overconstrains  -scene 
depth,  allowing  an  internal  check  on  the  algorithm's  answer. 
Thus,  the  special-lens  algorithm  provides  information  com¬ 
parable  to  the  best  that  is  theore* ically  available  from  three-or- 
more-image  stereo  and  motion  algorithms.  The  major  limita¬ 
tion  of  the  special-lens  algorithm  appears  to  be  that  it  requires 
sufficient  high-frequency  information  to  measure  the  change  in 
focus  This  is  roughly  similar  to  the  requirement  of  stereo  and 
motion  algorithms  that  there  be  distinguishable  imagr  features. 

One  question  concerning  the  use  of  focus  information  is 
whether  such  information  is  sufficiently  free  of  noise  to  be  useful. 
Research  in  estimating  the  point  spread  function  of  an  unfamiliar 
imagr  has  shown  that  both  the  ampin  n  e  and  phase  components 
of  the  Fourier  transform  are  sufficiently  stable  to  allow  useful 
estimation  of  the  point  spread  function  in  the  presence  of  normal 
imagr  noise  [6,  7]  Thus,  it  appears  that  the  issue  of  noise  is  not 
likely  to  be  an  insurmountable  hinderence. 

Perhaps  the  major  issue  in  the  practical  application  of  focus- 
measuring  algorithms  is  resolution.  The  normal-lens  algorithm 
can  be  applied  to  virtually  any  image;  however,  it  requires  that 
tbe  digitization  be  fine  enough  to  adequately  resolve  the  point 
spread  function.  For  a  35-mm  slide  taken  in  the  normal  manner, 
this  may  mean  digitizing  with  the  12  micron  resolution  available 
on  better-quality  digitizers.  The  resulting  image,  of  course,  will 
have  somewhat  morr  pixels  than  is  currently  the  norm.  This 
plethora  of  pixels  can  be  averted  by  using  a  combination  of  focal 
length  and  f-nvjber  that  results  in  a  relatively  smaller  depth  of 
field  than  is  typically  employed  by  most  photographers,  as  was 
done  in  the  example  shown  here.  It  is  worth  noting  that  the 
human  eye  typically  has  a  /-number  of  approximately  4,  and 
thus  has  exactly  such  limited  depth  of  Geld. 

Resolution  is  also  an  issue  for  the  special-lens  algorithm,  as 
the  resolution  of  the  resulting  depth  map  is  the  original  image 
resolution  divided  by  tbe  site  of  the  region  used  to  calculate  the 
Fourier  transform.  It  appears  that  a  region  of  perhaps  10  X  10 
pixels  should  be  used  in  obtaining  the  transform,  and  thus  tbe 
resolution  of  the  depth  map  will  be  1/10'*  tbe  resolution  of  the 
original  image. 


Human  Perception.  It  is  interesting  to  note  that  the  human 
visual  system  measures  the  information  needed  by  the  special- 
lens  system  in  at  least  two  ways.  First,  the  depth  of  field  for 
the  red-green  retinal  cells  is  different  from  that  for  tbe  blue 
retinal  cells,  because  of  one  diopter  of  chromatic  aberration  in 
tbe  lens.  This  provides  two  simultaneous  views  of  tbe  scene 
with  dissimilar  depth  of  field,  albeit  in  different  spectral  bands. 
Second,  the  focal  length  of  the  human  eye  is  constantly  varying 
in  a  sinusoidal  fashion  at  a  frequency  of  about  2  hz  [8],  The  range 
of  variation  depends  upon  the  average  accommodation*  [9],  but 
can  be  almost  one  diopter**  under  normal  conditions.  Thus,  two 
views  of  tbe  same  scene  with  different  depth  of  field  are  obtained 
within  two  hundred  and  fifty  milliseconds,  approximately  the 
duration  of  a  typical  fixation. 

A  experiment  demonstrating  the  importance  of  depth  of 
field  in  human  percept  ion. ran  be  easily  performed  by  tbe  reader. 
Make  a  pinhole  camera  by  poking  a  bole  through  a  piece  of  paper 
with  a  bail-point  pen.  Imposition  of  a  pinhole  in  the  line  of 
sight  causes  the  depth  of  field  to  be  very  large,  thus  effectively 
removing  this  depth  cue  from  tbe  image.  Close  one  eye  and 
view  the  world  through  the  pinhole,  noting  your  impression  of 
depth.  Now  quickly  remove  the  pinhole  and  view  the  world 
normally  (still  using  only  one  rye).  The  change  in  the  sense  of 
depth  caused  by  adding  or  removing  the  depth-of-field  distance 
rue  is  remarkable;  observers  report  that  the  change  is  fully  com¬ 
parable  to  tbe  change  caused  by  going  from  monocular  viewing 
to  binocular  viewing,  or  the  change  which  occurs  when  a  sta¬ 
tionary  object  begins  to  move. 

Tbe  effect  of  tbe  pinhole  is  not  due  to  change  in  tbe  field 
of  view,  as  can  be  demonstrated  by  comparing  the  percept  ob¬ 
tained  through  the  pinhole  to  the  percept  obtained  through  a 
viewing  tube  which  occludes  a  similar  portion  of  the  scene.  The 
effect  is  also  not  due  to  removal  of  the  accomodation  depth  cue 
because,  although  it  is  true  that  pinhole  viewing  makes  the  eye’s 
accommodative  state  irrelevant,  accomodation  can  provide  an 
estimate  of  depth  at  only  a  single  point  [10].  Thus  removing 
the  accommodation  cue  can  oniy  change  one’s  impression  of  the 
average  distance  or  of  distance  to  the  central  point,  not  of  tbe 
depth  relations  throughout  the  scene. 


'Accomodation  refers  to  tbe  focal  length  of  tbe  eye's  lens. 
"Diopters  are  tbe  reciprocal  of  focal  length  (i.e.,  D  —  1/F). 
One  diopter  is  approximately  the  strength  of  one's  first  pair  of 
glasses. 
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Figure  S  Geometry  of  Imaging,  uo  is  the  distance  between  the  image 
plane  and  the  lens,  uo  is  the  distance  between  the  lens  and  the  locus 
of  perfect  focus,  and  r  is  the  radius  of  the  lens.  When  an  point  at 
distance  «  >  «o  is  projected  through  the  lens,  it  focuses  at  a  distance 
c  <  so,  so  that  a  blur  circle  is  formed. 


or 


t’o  -  F  —  of 

where  /  is  the  f-number  of  the  lens. 

The  blurring  of  the  image  is  better  described  by  the  point 
spread  function  than  by  a  blur  circle,  although  the  blurring  is 
bounded  by  the  blur  circle  radius  in  the  sense  that  the  point 
spread  function  is  less  than  some  t  hreshold  outside  of  the  blur 
circle.  The  point  spread  function  is  due  primarily  to  diffraction 
effects,  which  for  any  particular  wavelength  produce  wave  cancel¬ 
lation  and  reinforcement  resulting  in  intensity  patterns  qualita¬ 
tively  similar  to  the  sine  function,  but  with  different 

amplitudes  and  periods  for  the  “rings”  around  the  central  peak. 

The  point  spread  function  describes  the  image  intensity 
f(p,  p)  caused  by  a  single  coherent  point-source  light  in  terms 
of  the  parameters  of  the  lens  system.  It  is  described  [1 1]  by 


4.  Appendix 


For  a  thin  lens. 


I  I  —  I 

«  +  v~F 


<H) 

where  u  is  the  distance  between  a  point  in  the  scene  and  the 
lens,  v  the  distance  between  the  lens  and  the  plane  on  which 
the  image  is  in  perfect  focus,  and  F  the  focal  length  of  the  lens. 
Thus, 

Fv 

(12) 

For  a  particular  lens,  F  is  a  constant.  If  we  then  fix  the 
distance  v  between  the  lens  and  the  image  plane  to  the  value 
v  —  up,  we  have  also  determined  a  locus  of  points  at  distance 
u  •"  u0  that  will  be  in  perfect  focus,  i.e., 

--£V  (13) 

We  may  now  explore  what  happens  when  a  point  at  a  dis¬ 
tance  u  >  Uo  is  imaged.  Figure  5  shows  the  situation  in  which  a 
lens  of  radius  r  is  used  to  project  a  point  at  distance  u  onto 
an  image  plane  at  distance  vq  behind  the  lens.  Given  this 
configuration,  the  point  -,’ould  be  focused  at  distance  t>  behind 
the  lens  —  but  in  front  of  the  image  plane.  Thus,  a  blur  circle  is 
formed  on  the  image  plane.  Note  that  a  point  at  distance  u  < 
u0  also  forms  a  Mur  circle;  throughout  this  paper  we  assume 
that  the  lens  system  is  focused  on  the  nearest  point  so  that  u 
is  always  greater  than  u«.  This  restriction  in  not  neccessary  in 
the  second  algorithm,  as  overconstraint  on  the  distance  solution 
allows  determination  of  whether  D  — ■  u  >  u0  or  D  —  u  <  Uo. 

From  the  geometry  of  Figure  S  we  see  that 


tun  I  «w  -  “ - 

V  Vo  —  V 


( 1 4) 


Combining  Equations  (12)  and  (M)  and  substituting  the  distance 
D  for  the  variable  u  we  obtain 

D _ _ 

riu  -  F[t  +  w) 


where 


iw/*,*')-  £(-i 

»—o 

)*(*-« o) 


where  X  is  the  wavelength  of  the  light,  r  is  the  distance  from  the 
center  of  the  point  spread  function,  JH(v)  is  th.  Bessel  function 
of  the  first  kind  and  order  n,  and  t>0,  v,  o  and  F  arc  as  before. 

The  “rings”  produced  by  this  function  vary  in  amplitude, 
width  and  position  v  ith  different  states  of  focus  and  with 
different  wavelengths.  As  wavelength  varies  these  rings  change 
position  by  as  much  as  90°,  so  that  the  blue  light  troughs  be¬ 
come  positioned  over  the  red  light  peaks,  etc.  Further,  change 
in  wavelength  results  in  substantial  changes  in  the  amplitude  of 
the  various  rings.  Although  this  point  spread  function  is  quite 
complex,  and  the  sum  over  different  wavelengths  even  more  so, 
it  appears  that  the  envelope  for  the  various  functions  has  the 
general  shape  of  a  two-dimensional  Gaussian. 

Sampling  effects  caused  by  digitiration  are  typically  next  in 
importance  after  the  diffraction  effects.  The  effect  of  sampling 
may  be  accounted  for  in  the  point  spread  function  by  convolv¬ 
ing  the  above  diffraction-produced  point,  spread  function  with 
functions  of  the  form  Other  factors  such  as  chromatic 

abbemion,  movement,  and  diffusion  of  photographic  emulsion 
may  also  be  accounted  for  in  the  final  point  spread  function  by 
additional  convolutions. 

The  net  effect,  in  light  of  the  central  limit  theorem,  is  prob¬ 
ably  best  described  by  a  two-dimensional  Gaussian  G(r,  ?)  with 
some  spatial  constant  o,  although  the  question  is  still  being  in- 


....  ? . 
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vestigated.*  The  spatial  constant  a  of  the  point  spread  function 
will  be  proportional  to  the  radius  of  the  blur  circle;  however 
the  constant  of  proportionality  will  depend  on  the  particulars  of 
the  optics,  sampling,  etc.  In  this  paper  the  radius  of  the  blur 
circle  and  the  spatial  constant  of  the  point  spread  function  have 
been  treated  as  identical;  in  practical  application  where  recovery 
of  absolute  distance  is  desired  the  constant  of  proportionality  k 
must  be  determined  for  the  system  and  included  in  Equation  ( 1 ) 
as  follows: 

D _ _ _ 

vo  -F-ekf 
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’Even  if  effects  other  than  difraction  are  very  small,  the  result¬ 
ing  point  spread  function  can  be  quite  well  modeled  by  the 
difference  of  three  Gnussians.  Using  this  model  of  the  point 
spread  function  makes  the  computations  of  Equations  (7)  • 
(10)  somewhat  more  difllcnit,  bet  the  nature  of  the  result  is 
unaffected. 
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Abstract 

This  report  describes  conceptual  work  on  a  sys¬ 
tem  that  reasons  between  structure  and  function  in 
the  domain  of  physical  objects  interacting  through 
mechanical  contact.  Structure  is  represented  by  a 
hierarchy  consisting  of  generalised  cones,  abstract 
shape  descriptions,  and  motion  sequences  of  the 
spines  of  generalized  cones.  Function  is  represented 
by  a  hierarchy  of  kinematic  primitives,  functional 
primitives,  and  causal  networks.  Doth  qualitative 
and  quantitative  reasoning  are  used  to  determine 
the  relationship  between  structure  and  function. 

\ 


Overview 

The  objective  of  high  level  vision  research  has 
been  to  recover  the  3-d  structure  of  a  scene  from 
image  uata.  In  order  for  an  intelligent  entity  to 
carry  out  tasks  in  the  real  world  the  perceived  3- 
D  structure  of  a  scene  is  transformed  into  objects 
indentified  by  functional  category.  Comparison  to  a 
set  of  structural  prototypes  is  one  method  to  do  this 
transformation.  However  it  is  inherently  weak  when 
confronted  with  objects  differing  from  the  stored 
prototypes  since  there  is  no  good  object  indepen¬ 
dent  similarity  metric.  It  is  also  unable  to  use  ob¬ 
jects  adaptively-  e.g.  to  see  that  a  dime  can  be  used 
as  a  screwdriver. 

A  system  that  understands  and  reasons  be¬ 
tween  structure  and  function  would  have  many  ap¬ 
plications.  This  ability  is  a  necessary  component  of 
an  intelligent  robot,  able  to  carry  out  tasks  in  un¬ 
constrained  enviorments.  The  system  could  serve 
as  an  intelligent  assistant  for  mechanical  designers. 


This  system  could  also  solve  the  crucial  difficulty 
in  the  parts  inspection  problem  of  classifying  struc¬ 
tural  anomalies  as  cosmetic  flaws  or  functional  im¬ 
pairments.  Since  humans  understand  the  world  in 
functional  terms,  it  would  facilitate  human  inter¬ 
face. 

The  problem  of  reasoning  between  structure 
and  function  is  generic  across  domains.  AI  re¬ 
searchers  are  developing  automatic  programming 
systems  which  take  functional  spe.  ifications  and 
produce  computer  programs,  while  others  have 
focused  on  digital  systems,  analog  circuits,  and  or¬ 
ganic  chemistry.  Previous  work  in  the  domain 
of  objects  interacting  through  mechanical  contact 
have  either  used  geometric  domains  simplified  to 
1  or  2  dimensions  with  a  limited  vocabulary  of 
physical  interaction  [Forbus  81a]  [deKlccr  75],  or 
have  focused  on  the  causal  interaction  in  complex 
mechanisms  jllieger  78].  This  report  focuses  on  the 
relation  between  3-d  geometry  and  function. 

The  first  section  of  the  paper  discusses  the  rep¬ 
resentation  of  function  for  3-d  objects  interacting 
through  mechanical  contact.  Although  it  is  not  ex¬ 
haustive,  it  shows  how  functional  properties  can 
be  built  up  out  of  physical  properties  associated 
with  structural  properties.  The  second  section  dis¬ 
cusses  generalized  cones  as  an  appropriate  repre¬ 
sentation  of  3  dimensional  objects  for  determining 
kinematic  properties.  The  third  section  discusses 
representations  and  methods  for  reasoning  between 
structure  and  function,  including  some  preliminary 
work  on  the  use  of  analogy. 

Functional  Representation 

The  function  of  an  object  is  represented  as 
a  hierarchical  description  with  symbolic  causal 
relationships  at  the  top  level  and  kinematic  primi¬ 
tives  at  the  bottom  level.  At  an  intermediate  level 
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are  functional  primitives,  which  are  used  as  nodes 
in  the  causal  relationships  and  are  instantiated  by 
the  kinematic  primitives.  The  following  arc  some 
examples  of  functional  primitives,  represented  as 
triplets  of  the  form  (function  agent  object): 

(grasp  hand  screwdriver) 

(support  table  lamp) 

(contain  glass  water) 

(support  chair  human  being) 

(cut  knife  meat) 

(contain  closet  clothes) 

(brace  strut  chair-leg) 

The  function  of  an  object  is  the  causal  relation¬ 
ships  in  which  an  object  is  a  casual  agent.  Causality 
is  treated  as  a  simplifying  mental  construct  for  the 
purposes  of  description  and  reasoning  rather  than 
a  metaphysical  property.  Thus  it  is  possible  for 
events  linked  by  a  causal  relationship  to  be  reversed 
in  the  normal  time  sequence  [deKleer  79].  The  fol¬ 
lowing  is  an  example  of  a  causal  network  of  a  set  of 
functional  primitive  triplets: 

(grasp  robot-hand  screwdriver)  and  (rotate 
robot-hand)  cause  (rotate  screwdriver) 

(mated  screwdriver-tip  screw-slot)  and  (rotate 
screwdriver)  cause  (rotate  screw) 

The  functional  primitives  are  instantiated  as 
kinematic  primitives.  The  kinematic  primitives  are 
motions,  constraints  on  motions,  and  the  relation¬ 
ship  of  these  two  to  various  types  of  forces.  For  ex¬ 
ample  the  functional  primitive  (support  table  lamp) 
is  instantiated  as  a  constraint  on  motion  in  the 
direction  of  gravity  imposed  by  the  table  on  the 
lamp.  The  force  associated  with  this  constraint  is 
that  necessary  to  either  bend  the  table  or  to  break 
the  table.  Also  associated  with  the  interaction  be¬ 
tween  the  table  and  the  lamp  are  frictional  forces 
that  constrain  the  motion  of  the  lamp  in  the  plane 
of  the  table’s  surface.  These  forces  could  be  rep¬ 
resented  as  numerical  quantities,  algebraic  expres¬ 
sions,  or  more  abstractly  as  a  partial  order.  For 
example  th  bending  force  is  less  than  the  breaking 
force,  while  the  rolling  coefficient  of  friction  is  less 
than  the  sliding  coefficient  of  friction.  Ken  Forbus’ 
work  on  quantity  space  shows  how  a  partial  order  of 
quantities  is  useful  for  qualitative  reasoning  [Forbus 
81b).  The  description  of  motion  and  forces  requires 
a  co-ordinate  system  for  the  associated  vectors.  The 
next  section  discusses  how  the  choice  of  co-ordinate 
systems  constrains  the  choice  of  a  representation 
for  structure. 


Structural  Representation 

For  objects  interacting  through  mechanical 
contact,  the  key  components  of  the  structure  are 
the  volume,  surfaces,  edges,  and  points  and  the 
mechanical  forces  associated  with  these  geometric 
properties.  Objects  occupy  a  volume,  and  some 
amount  of  force  is  needed  to  deform  this  volume 
or  fragment  the  object.  Surfaces  have  static,  slid¬ 
ing,  and  rolling  coefficients  of  friction.  A  sharp  edge 
can  cut,  while  a  point  can  pierce.  The  co-ordinate 
system  for  motions  and  forces  should  be  centered 
in  the  objects.  For  example,  consider  a  descrip¬ 
tion  of  a  man  walking  first  in  the  co-ordinate  sys¬ 
tems  of  the  body  and  limbs  of  the  man  and  then  in 
the  co-ordinate  system  of  a  camera.  The  descrip¬ 
tion  of  motion  is  simple  and  concise  in  the  first  set 
of  co-ordinate  systems,  but  in  the  camera- centered 
co-ordinate  system  the  motion  of  the  legs,  arms, 
and  hands  are  very  complex.  [Marr  76]  argues  for 
an  object-centered  co-ordinate  system  with  one  axis 
along  the  direction  of  elongation.  The  choice  is  dic¬ 
tated  by  the  domain,  for  example  in  the  domain  of 
free-falling  objects  in  a  gravitational  field  a  global 
co-ordinate  system  oriented  along  the  direction  of 
gravity  would  facilitate  the  description  of  motion. 
Generalized  cones  [Dinford  71]  are  a  concise  rep¬ 
resentation  of  3-D  objects  with  an  object-centered 
co-ordinate  system  oriented  along  an  axis  of  elonga¬ 
tion.  [Nevatia  74]  describes  a  program  that  builds  a 
generalized  cone  description  of  a  scene  from  a  laser- 
rangefinder  depth  map. 

As  implemented  in  „he  Acronym  system 
generalized  cones  provide  a  simple  constructive  rep¬ 
resentation  of  shapejBrooks  81].  Volumes  are  2- 
D  cross-sections  swept  along  a  spine  while  sur¬ 
faces  are  cither  the  end-faces  or  the  perimeter  of 
the  cross-section  swept  along  the  spine.  Similar 
properties  hold  for  edges  and  points.  This  descrip¬ 
tion  facilitates  the  transformation  of  surface,  edge, 
and  point  co-ordinate  systems  into  the  co-ordinate 
system  of  the  body  of  the  object.  Thus  the  forces 
arising  between  surfaces  of  contacting  objects  ex¬ 
pressed  in  surface  co-ordinate  systems  can  be  easily 
transformed  into  motions  and  constraints  on  mo¬ 
tion  of  the  objects  themselves.  The  construc¬ 
tive  analytic  representation  of  generalized  cones 
in  Acronym  also  facilitates  the  computation  and 
reasoning  about  mechanical  properties  such  as 
weight,  surface  area,  and  frontal  cross-section  for 
computing  drag. 
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While  generalized  cones  and  the  mechanical 
properties  of  objects  provide  a  good  representation 
for  reasoning  between  structure  and  kinematic 
primitives,  more  abstract  representations  of  struc¬ 
ture  facilitate  reasoning  between  structure  and 
functional  primitives.  The  distinction  between  ac¬ 
cidental  and  criterial  structural  features  is  often 
brought  out  through  more  abstract  descriptions. 
For  example  structural  symmetry  is  a  good  indica¬ 
tion  of  functional  symmetry.  Grouping  of  objects 
is  often  associated  with  a  particular  function,  for 
example  grouped  objects  that  are  physically  cross- 
linked  are  often  some  type  of  support  structure. 
Protrusions  and  intrusions  are  used  when  two  ob¬ 
jects  are  mated  together  for  some  purpose,  such  as 
the  tip  of  a  screwdriver  mating  the  slot  in  the  head 
of  a  screw.  Negative  volumes  open  at  one  end  sug¬ 
gest  containment  of  solids  or  liquids.  [Hollerbach 
75]  discusses  computing  descriptions  of  intrusions 
and  protrusions  from  a  sub-class  of  generalized 
cones. 

At  the  most  abstract  level,  structure  could  be 
described  topologically  or  through  the  spines  of 
generalized  cones.  Topological  representations  of 
structure,  such  as  connected,  adjacent,  and  contain 
are  rather  weak  except  for  domains  which  are  in¬ 
herently  topological  such  as  graphs.  An  example 
would  be  a  network  of  roads.  The  spines  of  general¬ 
ized  cones  correspond  to  stick  figure  diagrams, 
which  lead  to  the  strong  impression  of  causal  in¬ 
teractions  when  their  motion  through  time  is  simu¬ 


lated. [Soroka  80]  describes  using  a  simulation  of  the 
time  sequence  of  positions  of  the  spines  of  a  robot 
arm  to  debug  robot  assembly  programs.  [Marr  80] 
discusses  using  a  state- motion- state  description  to 
segment  actions  based  on  information  in  the  relative 
positions  of  the  spines  of  generalized  cones  through 
time. 

Relationship  between  Structure  and  Function 

The  relationship  between  structure  and  func¬ 
tion  depends  on  the  level  of  description.  At  the 
level  of  kinematic  primitives  and  generalized  cones 
the  relationship  can  be  described  using  parametric 
constraints  on  shapes  and  mechanical  properties. 
For  example,  a  lamp  will  stay  in  one  fixed  spot  on 
the  plane  of  a  table  as  long  as  lateral  forces  do 
not  exceed  the  static  frictional  forces  between  the 
surface  of  the  table  and  the  bottom  of  the  lamp. 
A  robot  hand  rigidly  grasps  a  screwdriver  if  the 
forces  applied  by  the  hand  constraining  the  six  de¬ 
grees  of  freedom  of  the  screwdriver  exceed  any  other 
forces  applied.  An  object  provides  a  comfortable 
sitting  support  for  a  human  being  if  its  dimensions 
match  those  determined  by  ranges  of  human  physi¬ 
cal  dimensions.  These  constraints  are  generally 
inequalities  over  variables  and  terms  composed  of 
sums, products, divisions,  and  trigonometric  func¬ 
tions  of  terms.  Acronym  currently  has  a  sub-system 
that  represents  and  manipulates  these  types  of  con¬ 
straints. 


Topological  realatlonshlps 
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I 
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Hierarchy  of  functional  and  structural  descriptions  with 
methods  of  reasoning  between  them. 
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At  the  level  of  functional  primitives  and 
abstract  shape  descriptions  the  relationship  can  be 
described  using  enabling  links  and  inhibiting  links 
in  the  causal  network.  For  example: 

(open-concavity  glass)  ENABLES  (contain 
glass  liquid) 

(protrusion  tip-of-screwdriver)  and  (intrusion 
slot-of-screwhead)  and  (same-cross-section  tip-of- 
screwdriver  slot-of-scrcwhead)  ENABLES  (mated 
tip-of-screwdriver  slot-of-screwhead) 

(NOT- ALIGNED  screwdriver  screw)  INHIBITS 
(mated  tip-of-screwdriver  slot-of-screwhead) 

Work  performed  at  MIT  demonstrated  that 
analogy  can  be  a  useful  tool  for  reasoning  be¬ 
tween  structure  and  function  at  this  level  of 
description[Binford  82],  Winston’s  analogy  pro¬ 
gram  coupled  with  Katz’s  natural  language  parser 
were  used  to  input  natural  language  descriptions 
of  mechanical  interactions  such  as  a  screwdriver 
turning  a  screw.  The  shape  properties  that  en¬ 
abled  these  interactions  were  also  described.  Using 
analogy  the  program  decided  that  an  alien  wrench 
might  plausibly  be  used  to  turn  an  alien  bolt  based 
on  their  abstract  shape  descriptions. 

At  the  most  abstract  level  of  description  pat¬ 
terns  of  movements  between  objects  correspond 
to  causal  networks.  For  example,  the  structural 
description  of  the  sequence  of  a  robot  hand  moving 
to  an  object,  moving  its  fingertips  so  that  they  hold 
the  object,  moving  the  hand  to  a  new  location,  and 
then  moving  its  fingertips  apart  has  a  correspond¬ 
ing  functional  description  of  grasping,  moving,  and 
ungrasping  an  object.  The  overall  netwo,  k  can  be 
represented  as  moving  the  object. 

Summary 

This  report  described  conceptual  work  on  a 
system  that  reasons  between  structure  and  function 
for  3-D  objects  interacting  through  mechanical  con¬ 
tact.  The  hierarchical  description  of  structure  and 
function,  and  the  relationship  and  the  methods  used 
to  reason  between  them  are  shown  in  the  preceding 
figure. 

This  work  was  supported  by  ARPA  contract 
N00039-82-C -0250 
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Abstract 

J  Mackworths  gradient  space  has  proven  to  be  a  useful  tool  tor 
image  understanding  However,  descriptions  of  its  important 
properties  have  been  somewhat  scattered  in  the  literature. 

This  pap  -r  develops  and  summarizes  the  fundamental  properties 
of  the  gradient  space  under  orthography  and  perspective,  and  for 
curved  surfaces  While  largely  a  recounting  of  previously  published 
results,  there  are  a  number  of  new  observations,  particularly 
concerning  the  gradient  space  and  perspective  projection.  In 
addition,  the  definition  and  use  of  vector  gradients  as  well  as 
surface  gradients  provides  concise  notation  tor  several  results. 


_  \ 


Preliminary  Definitions 

Coordinate  System  . 

In  these  equations,  the  coordinate  system  being  used  is 
Mackworih's  [12]:  the  x  and  y  axes  in  the  scene  are  aligned 
in  the  image  (x  horizontal,  y  vertical),  and  the  z  axis  points 
towards  the  viewer  (i.e.  a  right-handed  coordinate  system) 
(figure  1).  The  eye  (center  of  lens)  is  at  the  origin  (0.0,0). 
and  the  imaoe  plane  is  z  -  -  1  (i.e.  the  focal  plane  is  z  *  1 , 
which  is  rotated  around  the  origin  to  the  image  plane,  z  = 
-1,  to  preserve  the  sense  of  "up",  "down",  "left",  and 
"right"  from  the  scene). 


The  properties  explored  in  the  paper  include  the  orthographic 
and  perspective  protections  themselves:  the  definition  of  gradients; 
the  gradient  space  consequences  of  vectors  (edges)  belonging  to 
one  or  more  surfaces,  and  of  several  vectors  being  contained  on  a 
single  surface ,  and  the  relationships  between  vanishing  points, 
vanishing  lines,  and  the  gradient  space. 

The  paper  is  intended  as  a  study  guide  lor  learning  about  the 
gradient  space,  as  well  as  a  reference  tor  researchers  working  with 
gradient  space. 


Introduction 

The  gradient  space  has  proven  a  useful  tool  for  image 
understand  ng.  Since  its  proposal  by  Mackworth  [12]  based  on 
Huffman’s  dual  space  [5],  the  gradient  space  has  been  used  for 
defining  consistency  of  line-labelings  (6, 7],  relating  surface 
orientation  to  image  intensity  [3, 4, 18],  and  relating  surface 
orientation  to  image  geometry  [8, 0, 10, 13, 18). 

The  descriptions  of  important  gradient  apace  properties, 
however,  havr  been  scattered  throughout  the  literature.  In  this 
paper,  the  gradient  space  is  defined  and  its  fundamental  properties 
are  summarized.  This  presentation  is  especially  useful  because  its 
assignment  of  gradients  to  vectors  as  well  as  surfaces  allows 
concise  statements  of  important  properties. 

This  paper  is  primarily  a  summary  and  re  statement  of  important 
gradient  space  properties,  but  also  includes  statements  of  some 
new  properties.  It  is  intended  that  the  paper  provide  a  reference  for 
people  working  with  the  gradient  space,  as  well  83  a  study  guide  for 
researchers  being  introduced  to  the  gradient  space. 


scene 


Orthography 

In  orthographic  projection,  the  scene  point  (x,  y,  z)  is 
mapped  onto  the  image  point  (x,  y).  Thus,  the  image  point 
(x,  y)  represents  the  set  of  scene  points  (x,  y,  z)  for  all  values 
of  z. 

Perspective 

In  perspective  projection,  the  scene  point  (x,  y,  z )  is  mapped 
onto  the  image  point  ( -  x/z,  -  y/z).  The  image  point  is  the 
point  at  which  a  line  through  the  origin  (eye)  and  (x.  y,  z) 
intersects  the  image  plane.  The  unit  of  measure  in  the 
coordinate  system  is  the  focal  length  of  the  camera  lens. 
This  is  similar  to  Render's  coordinate  system,  but  with  the 
direction  ol  the  z-axis  reversed  [11],  An  image  point  (x,  y) 
corresponds  to  (he  set  of  scene  points  (ax,  ay,  -a)  for  all 
values  of  a. 
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Gradient  Space  and  Orthography 

In  this  section  important  relationships  between  surfaces,  vectors, 
and  gradients  are  described.  In  addition,  several  important 
observations  concerning  orthographic  projection  are  noted. 


Corollaries  of  this  result  are: 

1 .  Parallel  vectors  have  the  same  gradient.  The  gradient  ol 
a  line  can  be  defined  as  the  gradient  of  any  vector 
contained  in  the  line. 


1.  Definition  of  Surface  Gradient 

Suppose  a  surface  is  defined  as  -2  *  f  (x,  y).  Then  its 
gradient  (p,  q)  is  defined  by  [12]: 


(P.d) 


3 1  it  -3 z  -  3 2 

^3x  '  3 y*  ^  3*  3y  ^ 


The  set  of  all  gradients  (p,  q)  is  the  gradient  space. 
Corollary  to  this  result  is: 


2.  The  gradient  of  a  surface  is  the  same  as  the  gradient  of 
its  surface  normal  vectors  (figure  2). 

3.  Under  orthography,  the  vector  (ax,  ay.  az)  is  seen  in  the 
image  as  an  edge  £  «  (ax,  ay).  If  the  vector's  gradient  is 
G,  then  G  =  £  /  a z.  The  line  in  gradient  space  from  the 
origin  through  G  is  thus  parallel  to  the  edge  in  the  image 
(figure  3). 


1 .  In  any  direction  u,  the  tangent  vector  to  the  surface  is: 
dx  dy  dx  dy 

(— ■  — •  -P—  -  <7— ) 
do  du  d u  d u 

The  tangent  vectors  in  the  directions  u  >  x  and  u  »  y  are 
thus  (1,  0,  -p)  and  (0,  1,  -q)  respectively.  Their  cross 
product,  (p,  q,  1).  is  therefore  a  surface  normal. 


2.  Gradient  of  a  Plane 

Suppose  a  plane  is  defined  by  /lx  +  fly  ♦  Cz  +  D  «  0. 
Then  its  gradient  is: 


(P.d) 


A  B 
(—.—) 
C  C 


Corollary  to  this  result  is: 


1.  Since  D  has  no  effect  on  p  and  q.  parallel  planes  have 
the  same  gradient.  Each  point  (p,  q)  thus  represents  the 
gradient  for  a  family  of  parallel  planes. 

3.  Gradient  of  a  Vector 

Suppose  a  vector  is  (ax,  Ay,  az).  Then  its  gradient  can  be 
defined  as: 


ax  Ay 

(P.d)  « 

AZ  AZ 


(This  is  not  Huffman’s  dual  line  [5]:  the  dual  is  the  line 
described  below  in  section  7.) 

Although  the  term  gradient  technically  refers  to  a  property 
of  differentiable  surfaces,  it  is  used  here  for  vectors  because 
the  gradient  space  can  represent  3D  orientation  in  general, 
not  just  orientation  of  surfaces. 


V 

Figure  2:  Gradient  of  Surface  and  Surface  Normal 


Figure  3:  Gradient  of  a  Vector  Under  Orthography 
4.  Vector  Cont-med  on  a  Surface 

Suppose  the  vector  (Ax,  Ay,  az)  is  contained  on  a  surface 
whose  gradient  is  (p,  q). 

Since  the  surface  normal  (p,  q,  1)  must  be  orthogonal  to  the 
vector, 


(p.q,  1)  (ax,  Ay,  Az)  »  0 
pAx  +  qAy  +  Az  «  0 

Therefore, 


(p.  q) '  (Ax,  Ay)  =  -  az 
Corollary  to  this  result  is: 

1.  Under  orthography,  the  vector  (ax,  Ay,  az)  is  seen  in  the 
image  as  the  edge  £  >  (Ax,  Ay).  If  it  is  contained  on  a 
surface  with  gradient  G,  then 

G  "£  «  -az 

This  is  one  of  the  most  important  relations  in 
orthography,  since  polyhedral  scenes  contain  surfaces 
bounded  by  many  edges. 

5.  Vector  Contained  on  Two  Surfaces 

Suppose  the  vector  (ax.  Ay,  az)  is  the  boundary  between  two 
surfaces  with  gradients  G,  -  (p,,  q,)  and  G,  «  (p2,  qp 
(figure  4).  Further,  define  £  to  be  (Ax,  Ay). 

■>  Then: 

p  -az  -  G,  £  -  G}  £ 

0  -  (G,  -  Gp  £ 

(G,  -  Gp  X  £ 

G,  -G2  is  the  vector  from  G,  to  Ga  in  the  gradient  apace. 
Thus,  the  vector  £  is  perpendicular  to  the  line  containing  G, 
and  G2  in  gradient  space. 
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Figure  4:  Vector  Contained  on  Two  Surfaces 
As  a  corollary; 

1 .  Under  orthography,  with  the  above  definitions,  the  edge 
in  the  image  is  E.  Therefore,  the  edge  in  the  image  is 
perpendii  Jar  to  the  line  containing  G,  and  G,.  (This  is 
Mackworth's  relation  involving  connect  edges  [12].) 

6.  Two  Vectors  Contained  on  a  Surface 

Suppose  the  two  vectors  (Ax,,  Ay,,  Az,)  and  (Ax,,,  Ay2,  AZj) 
both  lie  on  a  surface  whose  gradient  is  G  «  (p,  q)  Further, 
let  E,  «  (ax,,  Ay,)  and  E2  -  (ax2,  Ay.p  correspond  to  the  x-y 
components  of  the  two  vectors  (figure  5). 


As  a  corollary: 

1.  Under  orthography,  vectors  £,  and  £2  have  the  same 
coordinates  in  the  image  that  they  have  in  the  scene. 
So,  given  the  a z  values  for  two  vectors  on  a  surface,  the 
gradient  of  the  surface  can  be  found  using  the  image. 

7.  Gradients  of  Perpendicular  Vectors  and  Planes 

Suppose  two  vectors  (ax,,  Ay,,  az,)  and  (ax2,  Ay2,  a are 
perpendicular  (in  the  scene),  and  that  their  gradients  (as 
defined  above)  are  G,  ■  (p,,  q,)  and  G2  =  (p2,  (figure 
6).  Then: 


Figures:  Two  Vectors  Contained  on  a  Surface 
Then: 


(AX,,  Ay,,  AZ,)  •  <ax2,  a y2,  AZj)  .  0 
Ax,Ax2  +  Ay,Ay2  +  az,Az2  »  0 
Dividing  by  az,az2, 

P,P*  +  *1-0 

G,  G2.  -1 

Suppose  that  G,  is  given.  Then,  the  above  equation  is  a  line 
L  in  gradient  space,  which  is  the  loci  of  the  possible 
locations  of  G2-  In  fact, 

•  L  is  perpendicular  to  the  line  from  G,  to  the  origin 

•  the  distance  from  L  to  the  origin  is  the  reciprocal  of  the 
distance  from  G,  to  the  origin 

•  L  is  on  the  opposite  side  of  the  origin  from  G,. 


-Az,-E,;G 
-Az2  «  f2  G 

These  can  be  combined  into  a  single  matrix  aquation,  in 
which  the  upper  row  is  the  first  equation  above,  and  the 
lower  row  is  the  second  equation: 


This  expresses  Sts  surf  see  gradient  G  as  a  function  of  the  x, 
y,  and  z  components  of  two  vectors  contained  on  the 
surface. 


The  line  L  is  Huffman's  dual  line  (tor  a  line  in  the  image 
whose  gradient  is  G,)  [5], 

This  result  has  two  corollaries: 

1.  If  two  planes  are  orthogonal,  their  gradients  obey  the 
above  relationship,  since  their  surface  normals  are 
perpendicular. 

2.  If  a  plane  contains  a  vector,  the  gradients  of  the  plane 
and  vector  obey  the  above  relationship,  since  the  vector 
is  perpendicular  to  the  surface  normal. 

a  Rotation  of  the  Image  and  Gradient  Space 

Suppose  a  surface  is  defined  by  -*  ■  f  (x,  y)  and  has 
gradient  (p,  q). 
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If  we  rotate  the  x-y  axes  around  the  z-axis  through  some 
angle  0  to  new  <x  '-*•)  axes  (figure  7),  then: 


As  a  grows  larger,  the  Image  point  P  converges  to  some 
point  V  in  the  image  (figure  0):  4 


q'  *  — —  -  -p sin  9  *  qcos  9 
3y’ 

This  amounts  to  rotating  the  P-q  axes  by  8  to  determine  the 
p’q'  axes.  Thus,  rotation  of  the  image  corresponds  to 
identical  rotation  of  the  gradient  space. 

Perspective 

In  this  section,  perspective  projection  is  assumed  and  its 
consequences  in  gradient  space  are  described.  While  most  of  the 
results  were  presented  by  Kender[1t],  some  new  results  are 
included  >n  sections  12  and  13.  This  paper  does  not  deal  with 
scene  transformations  (such  as  rotation)  and  their  effects  in  the 
image  under  perspective  projection  [14],  nor  with  camera  models 
[1,2].  Instead,  it  describes  the  relationship  between  the  scene,  the 
image,  and  gradient  apace  under  perspective  projection. 

9.  Vanishing  Point  Of  a  Line 

Suppose  a  line  in  the  srene  is  defined  by  (x.  y,z)  +  e  (as, 
ay,  az)  for  all  values  of  a,  where  (x,  y,  z)  is  any  point  on  the 
Kne  and  (ax,  ay,  az)  is  a  direction  vector  of  the  line  (l.e.  any 
vector  contained  in  the  line).  For  any  a,  the  corresponding 
point  on  the  line  is  (x  +  ea x,  y  +  aay,  z  +  aaz)  and  its  image 
point  P(la: 


Corollaries  of  this  definition  are: 

1 .  If  a  line  (or  vector)  has  gradient  6  and  vanishing  point  V, 
then  G  ■  -  V. 

2.  Since  V  depends  only  on  the  direction  vector  (ax,  ay,  az), 
parallel  lines  have  the  same  vanishing  point.  Thus,  each 
point  in  the  image  is  the  vanishing  point  for  a  family  of 
parallel  lines. 

3.  If  a  line  passes  through  the  origin,  then  its  vanishing 
point  is  the  ooint  at  which  it  intersects  the  image  plane  (z 
-  -1). 

4.  The  vanishing  point  V  of  a  vector  must  lie  on  the  image 
line  L  containing  die  image  of  the  vector  (figure  8).  The 
vector's  gradient  G  must  therefore  lie  on  the  line  -L  in 
gradient  space,  where  - 1  is  the  line: 

•  parallel  to  L 

•  at  an  equal  distance  from  the  origin  as  L 


p.*<- 


-x  -a»x 
z  +  ssz 


-y  -  4*y 

z  ♦  tax 


•  on  the  opposite  side  of  the  origin  from  L. 

Two  additional  observations  concerning  G  are: 

a.  The  vanishing  point  of  a  vector  cannot  be  in  the 


middle  of  its  image,  so  if  £  is  the  image  of  the  vector, 
then  V  cannot  be  within  £  and  G  cannot  be  within 
-E  (figure  8). 

b.  If  the  vector  in  the  scene  is  parallel  to  the  image 
plane,  it  has  no  vanishing  point  (i.e.  V  is  infinitely  far 
away  on  L);  it  has  no  gradient  (i.e.  G  is  infinitely  far 
from  the  origin,  in  the  direction  parallel  to  L). 

10.  Vanishing  Line  of  a  Surface 

Suppose  a  surface  S  has  gradient  Gg.  For  any  vector  L  on  S 
with  gradient  Gu,  Gs  '  GL  *  -1.  as  shown  in  section  7. 
Since  the  vanishing  point  of  L  is  VL  =  -  GL  (by  corollary  1  of 
section  9), 


Suppose  that  Gs  is  given.  Then  the  above  equation  defines 
a  line  Vs  in  the  image,  containing  the  vanishing  points  VL  for 
all  vectors  L  contained  on  S  (figure  9).  In  fact, 


3.  Suppose  L  is  a  line  in  the  image.  There  exists  a  family  of 
parallel  surfaces  for  which  L  is  the  vanishing  line.  These 
surfaces  all  have  the  same  gradient,  which  might  be 
called  the  vanishing  gradient  for  L,  denoted  G*L.  Let  L 
be  the  set  of  points  (x,y)  defined  by  the  equation:  1  »  ax 
+  by  »  (a,t>)  •  (x,y).  Then  by  section  10,  since  (x,y)  is  a 
spoint  on  L,  (a,b)  must  be  the  gradient  of  the  surfaces  for 
which  L  is  the  vanishing  line,  i.e.  GVL  =<  (a,  b).  Thus,  for 
any  line  L  in  the  image,  we  can  determine  the  associated 
vanishing  gradient  GWL  ■■  the  gradient  of  the  surfaces  for 
which  L  is  the  vanishing  line. 

4.  Suppose  edge  £  is  the  image  of  some  vector  V  with 
gradient  Gv.  If  S  is  the  surface  through  the  origin  and  £, 
then  £  is  the  vanishing  line  of  S  (by  corollary  2  above) 
and  the  gradient  of  S  is  GVE  (by  corollary  3  above).  V 
must  be  contained  on  the  surface  S,  so  by  corollary  2  of 
section  7, 

Gv '  Gve  *  - 1 

This  is  the  relationship  between  a  vector  and  the 
vanishing  gradient  of  its  image. 

11.  Point  Contained  on  a  Surface 

Suppose  a  surface  S  has  gradient  G  =  (p,  q)  and  intersects 
the  x-axis  at  z  «D  (i.e.  the  plane  is  defined  by  px  +  qy  *  z  - 
D  *  0).  Let  P  =  (x,  y)  be  a  point  in  the  image;  it  must 
co r respond  to  some  point  X  on  surface  S  in  the  scene 
(figure  10). 

Since  the  image  of  X  is  P,  X  -  (ax,  ay,  -  a)  for  some  value  of 
a.  Since  X  also  lies  on  S. 

p|ax)  +  ofay)  +  ( -  a)  -  0  «=  0 

Solving  this  equation  for  a  yields  a  =  D  /  (px  ♦  qy  -  1)  •  O 
/(P  G  -  1)and 

>  X- - — (x.y, -1) 

p  P‘0-1 

X  is  sometimes  called  the  back-proiection  of  image  point  P 
onto  surfaces  [11], 


Figure  •:  Vanishing  Line  and  Gradient  of  a  Surface 

•  vg  is  perpendicular  to  the  line  from  Gs  to  the  origin 

•  the  distance  from  Vg  to  the  origin  is  the  reciprocal  of  the 
distance  from  Gs  to  the  origin 

•  Vs  is  on  the  same  side  of  the  origin  as  Gg. 

The  line  Vs  is  called  the  vanishing  line  of  the  surface;  it  is 
the  locus  of  vanishing  points  for  all  vectors  on  the  surface. 

Corollaries  of  this  definition  are: 

1  Since  v&  depends  only  on  Gs,  parallel  surfaces  have  the 
same  vanishing  Bne.  Thus,  each  line  in  the  image  is  the 
vanishing  line  for  a  family  of  parallel  surfaces. 

2.  If  a  surface  peases  through  the  origin,  its  vanishing  Hne 
is  the  line  along  which  it  intersects  the  image  plane  (z  - 
-1). 


Figure  10;  Back-Projection  of  a  Point  Onto  a  Surface 


The  quantity  D  /  (PG  -  1)  is  the  distance  from  X  to  the  x-y 
plane  (z  *  0).  If  this  value  is  negative,  then  X  is  not  in  the 
scene  (i.e.  it  is  behind  the  viewer),  and  P  does  not 
correspond  to  any  point  on  the  image  of  S. 

The  assumption  has  been  made  here  that  P  G  -  1  •  0,  i.e, 
that  P  does  not  lie  on  the  vanishing  line  of  S. 

12.  Vector  Contained  on  a  St< 'face 

Suppose  a  vector  V  has  gradient  Gv  and  lies  on  a  surface  S 
with  gradient  Gs.  Then  by  corollary  2  of  section  7: 


Now  suppose  that  V  is  visible  in  the  image  as  some  edge  £ 
(figure  11).  By  corollary  4  of  section  10,  if  GVE  is  the 
vanishing  gradient  of  £, 


These  equations  can  be  combined  into  a  single  matrix 
equation,  in  which  the  upper  row  is  the  first  equation,  and 
the  lower  row  is  the  second  equation: 


Since  Gve  is  determined  by  the  edge  E  in  the  image,  this 
equation  relates  the  gradient  Gv  of  a  vector  with  the  image  E 
of  the  vector  and  the  gradient  Gs  of  a  surface  containing  the 
vector. 


Figure  11:  Vector  Contained  on  a  Surface  Under  Perspective 

Corollary  to  this  result  is: 

1.  Combining  the  two  equations  in  a  different  manner 

0.(GS  G^-WyGy) 

0  .  (G8  -  G*e)  Gv' 

<°S-°V-LQv 

So,  the  line  L  In  gradient  apace  containing  G.  wid  G*  la 
perpendicular  to  the  line  through  the  origin  and  Gv 
(figure  11). 


?.  There  is  an  interesting  restriction  on  line  L  in  corollary  1. 
As  shown  in  corollary  1,  L  must  pass  through  Gvr  Ha 
slope  depends  on  the  gradient  Gv  of  V.  Gv  is  described 
in  corollary  4  of  section  9:  it  must  lie  on  the  line  -  E,  but 
not  within  the  line  segment  -E  corresponding  to  the 
edge  itself.  This  constrains  the  orientation  of  L  such  that 
the  line  through  GVE  perpendicular  to  L  cannot  pass 
through  the  line  segment  -  E  (figure  12).  Hence,  the 
position  and  length  of  an  edge  in  the  image  constrain 
the  gradients  of  surfaces  containing  the  corresponding 
vector  in  the  scene. 


PoaaMaOv 

Figure  12:  Restrictions  on  the  Slope  of  L 


13.  Vector  Contained  on  Two  Surfaces 

Suppose  a  vector  with  gradient  Gv  is  the  boundary  between 
two  surfaces  with  gradients  G,  and  G2  (figure  13).  If  the 
vector  appears  as  an  edge  £  in  the  image,  and  GVE  ie  the 
vanishing  gradient  of  £,  then  by  corollary  1  m  result  12, 

(G,  -  6\)  X  Gv 

(G2  -  G*e)  J.  Gy 

So,  (G,  -  G  t;  ||  (G2  -  Gve),  I.e.  G*e.  G,,  and  G2  are 
coHinear  in  gradient  space. 

Since  Gv£  was  determined  by  the  location  of  the  edge  £  hi 
the  image,  the  constraint  provided  on  G,  and  G2  is  that  they 
lie  on  a  line  which  passes  through  Gve.  This  line  is  the  same 
as  Lin  corollaries  1  and  2  of  section  12,  and  Ms  oritneation  is 
limited  to  certain  angles  depending  on  the  location  and  size 
of  the  image  edge  £. 

This  is  the  connect  edge  relation  under  perspective;  It  Is  the 
perspective  counterpart  to  corollary  1  of  section  5. 


Figure  13:  Vector  Contained  on  Two  Surfaces  Under  Perspective 

Curved  Surfaces  and  Arcs 

In  these  sections,  the  gradient  is  defined  for  an  arc  in  the  scene. 
Then,  using  calculus,  the  fundamental  results  are  developed 
concerning  curved  arcs  and  surfaces.  The  results  are  very  similar 
to  those  of  sections  4,  5,  and  6;  this  is  because  surface  and  arc 
gradients  capture  the  same  (first-order  differential)  information 
contained  in  tangent  lines  and  planes,  which  obey  the  results  Of 
sections  4,  5,  and  6. 

14.  Gradient  of  an  Arc 

Suppose  an  arc  4  is  defined  (in  parametric  form)  by  (x(s), 


Figure  15:  Gradient  of  an  Arc  Under  Orthography 


4.  Under  orthography,  the  tangent  vector  to  an  arc  is 
visible  as  the  edge  E  =  a  (dx/ds.  dy/ds)  for  some  value 
of  a  (figure  15).  II  the  gradient  of  the  arc  is  G  *  (dx/dz, 
dy/dz),  then  G  =  (£/a)  (ds/dz).  In  other  words,  the  line 
in  gradient  space  from  the  origin  to  G  is  parallel  to  the 
tangent  vector  E  seen  in  the  image. 

15.  Arc  Contained  on  a  Surface 

Suppose  an  arc  A  =  (xA(s),  yA(s),  zA(s))  is  contained  on  a 
surface  S  defined  by  -  z  m  I  (x.  y).  Let  the  arc  gradient  be 

Ga  =  ^a:  qA*  =  <dVd-V  ^a^a).  and  161  *he  surtac® 

gradient  be  Gs  =  (ps.gs)  =  (dt/dx,  df/dy).  Thenforalls, 
-rA(s)  »  '  (*A(s).  yA(s)) 

We  can  differentiate  using  the  rule: 


y(s),  z(s)).  Then  its  gradient  can  be  defined  as: 

dx  dy  dx  ds  dy  da 

*  (P.Q)  “  (— .  — ~)  *  (—  — .  —  ~T~) 
dz  dz  ds  dz  dz 

Note  that  both  p  and  q  are  themselves  funr  :  m .  of  s 

Corollaries: 

1 .  At  any  point  on  an  arc  defined  as  above,  the  tangent 
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Figure  14:  Gradient  of  an  Arc  and  a  Tangent  Vector 
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2.  If  (ax.  Ay,  az)  is  tangent  to  arc  A  at  point  X,  and  A  is 
contained  on  a  surface  whose  gradient  at  X  is  G,  then 
-  az  =  G  '  (Ax,  Ay).  Thus,  the  results  of  section  4  apply 
to  the  tangent  vector  to  an  arc. 

16.  Arc  Contained  on  Two  Curved  Surfaces 

Suppose  an  arc  A  =  (x(s),  y(s),  z(s))  with  gradient  GA  is  the 
boundary  between  two  surfaces  S ,  and  S2  with  gradients  G( 
and  G2  (figure  17)  At  any  point  on  A.  we  have  (by  corollary 
1  of  section  IS): 

-1  =Ga  G,  =Ga'G2 
°  =  Ga  (G,  -Oj) 

GaX(G,-G2) 

So  the  lire  containing  G,  and  G2  in  gradient  space  is 
perpendicular  to  the  line  from  the  origin  to  GA. 


Figu  re  1 7 :  Arc  Contained  on  Two  Curved  Surfaces 
Corollary  to  this  is: 

1  llndet  orthography,  we  can  combine  the  above  result 
with  corollary  3  o'-  section  14  to  conclude  that  the  line 
containing  G ,  and  G2  in  gradient  space  is  perpendicular 
to  the  tangent  to  the  arc  in  the  image.  This  is  the 
counterpart  to  the  connect  edge  relationship  for  curved 
surfaces  and  arcs. 

17.  Two  Arcs  Contained  on  a  Curved  Surface 

Suppose  arcs  A,  and  with  gradients  G,  and  G2  are 
contained  on  a  surface  with  gradient  Gs  (figure  18). 

At  a  point  of  intersection  of  At  and  A2,  we  have  (by  corollary 
1  to  section  15): 

-1  *  Gr°s 
- 1  *  g2  gs 

These  can  be  combined  into  a  single  matrix  equation  to 
yield: 


This  allows  us  to  compute  the  gradient  of  a  surface  from  the 
gradients  of  two  intersecting  arcs  on  the  surface. 

Corollary  to  this  is: 

1.  Under  orthography,  suppose  that  edges  f ,  and  in  the 
image  are  tangent  to  the  images  of  arcs  A  (  and  A2  at  a 
point  where  they  intersect, and  that  E1  and  f2 
correspond  to  scene  vectors  (axv  Ayr  az,)  and  (ax2, 
Ay2,  az2).  If  A1  and  A2  are  contained  on  a  surface  with 
gradient  Gs,  then: 


Thus,  under  orthography,  the  gradient  of  a  surface  can 
be  computed  from  the  az  values  for  two  vectors  tangent 
tc  arcs  on  the  surface,  at  a  point  where  the  arcs 
intersect. 

Summary 

In  this  paper,  we  have  defined  the  gradient  for  surfaces,  planes, 
arcs,  and  vectors,  and  we  have  seen  that  the  gradient  and  Az 
attribute  of  an  object  are  mutually  constrained.  We  have  also  seen 
that  knowledge  about  the  gradient  (or  Az-component)  of  a  sir  face 
can  be  used  to  determine  the  gradient  of  a  vector  or  arc  on  that 
surface,  and  that  knowledge  about  the  gradients  of  two  such 
vectors  or  arcs  can  be  used  to  uniquely  determine  the  surface 
gradient.  In  addition,  features  in  the  image  can  be  combined  with 
gradient  or  az  information  to  yield  three  dimensional 
.  econstructions  of  scene  objects,  under  both  perspective  and 
orthographic  projections. 

This  collection  of  theo.  .-ms  and  definitions  includes  a  recounting 
of  results  from  Huffman  (5),  Mackworth  [12],  and  Kender[11]  as 
well  as  some  new  notation  and  results.  The  definition  and  use  of 
arc  and  vector  gradients  as  well  as  surface  gradients  has  provided 
a  more  concise  notation  for  several  of  these  results. 
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\  ABSTRACT 

'OL 

In  this  paper  we  address  the  most  general 
version  of  the  problem  of  achieving  machine 
perception  and  tracing  of  "line-like”  structures 
appearing  in  an  image;  our  goal  is  human  level 
performance.  We  show  that  the  problem  can 
profitably  be  viewed  as  the  process  of  finding 
skeletons  in  a  gray  scale  image  after  observing 
(1)  that  line  detection  does  not  necessarily  depend 
on  gradient  information,  but  rather  is  approachable 
from  the  standpoint  of  measuring  total  intensity 
variation,  and  (2)  that  smoothing  the  original 
image  produces  an  approximate  distance  transform. 
We  present  an  effective  technique  for  extracting 
the  delineating  skeletons  from  an  image,  and  show 
examples  of  this  approach  employing  aerial, 
industrial,  and  radiographic  Imagery. 

fv 

I  INTRODUCTION 

For  many  tasks  in  scene  analysis,  there  may 
not  exist  general  solutions  independent  of  purpose 
or  Intended  application.  However,  for  the  task  of 
linear  delineation,  one  can  easily  find  image 
subsets  for  which  a  panel  of  human  observers  would 
be  almost  unanimous  in  their  interpretation  without 
having  to  agree  on  the  explicit  criteria  underlying 
their  decisions;  our  goal  is  to  produce  a  computer 
system  that  can  perform  the  delineation  task  at 
close  to  human  levels  for  at  least  these  more 
obvious  cases  (especially  where  semantic  knowledge 
la  not  required).  In  this  paper  we  present  some 
new  ways  of  looking  at  the  problem  of  linear 
delineation  and  provide  techniques  that  are 
significantly  more  general  and  effective  than 
previously  reported  methods  for  this  task. 


II  PROBLEM  DEFINITION 

For  the  purposes  of  this  paper,  we  define 
linear  delineation  (LD)  as  the  task  of  generating  a 
eat  of  lists  (of  coordinates)  of  points,  for  a 
given  2-D  image,  such  that  the  points  in  each  list 
fall  sequentially  along  what  any  reasonable  human 
observer  would  describe  as  s  claarly  visible  "line- 
like"  structure  in  the  image.  Practical  examples 
of  this  tssk  might  be  to  dellneete  the  roads. 


rivers,  and  rail  lines  in  an  aerial  photograph,  or 
to  trace  the  paths  taken  by  blood  vessels  in  a 
radiographic  angiogram,  or  to  locate  the  wiring 
paths  on  a  printed  circuit  board;  however,  our  goal 
in  this  paper  is  not  to  look  for  specific  real- 
world  objects  or  to  assign  semantic  labels  to  the 
detected  linear  structures,  but  rather  to  find  the 
most  perceptually  obvious  (to  a  human  observer) 
occurrences  of  such  structures.  We  further 
distinguish  between  the  problems  of  (1)  detecting 
the  edges  or  contours  of  extended  objects,  and 
(2)  delineating  those  objects  whose  appearance  is 
adequately  represented  by  a  central  skeleton  —  we 
only  address  the  second  problem. 


Ill  LINES  AND  EDGES 

While  most  approaches  to  LD  do  not  distinguish 
between  lines  and  edges,  and  even  use  edge 
detection  as  a  necessary  first  step  in  the 
delineation  task,  a  critical  concept  advanced  in 
this  paper  is  the  distinction  between  line  and  edge 
detection. 

Edge  detection  is  intuitively  based  on  the 
concept  of  a  discontinuity  in  intensity  (or  other 
locally  measurable  attribute  such  as  color  or 
texture)  between  two  adjacent  but  distinct  regions 
in  an  image.  However,  in  a  digital  representation 
of  an  image,  we  can  always  fit  a  smooth  surface  to 
the  sample  values  of  the  Integer  raster.  Thus, 
edge  detection  must  be  based  on  parameters  or 
thresholds  set  by  assumptions  about  the  nature  of 
the  image.  (Even  If  we  only  mark  as  edge  points 
those  locations  at  which  there  are  first  derivative 
maxima,  or  zeros  of  the  second  derivative,  we  must 
still  ultimately  make  an  arbitrary  decision  in 
deciding  when  the  corresponding  gradient  is  large 
enough  to  be  called  a  discontinuity.) 

Intuitively,  a  "line-like”  or  linear  structure 
is  a  (connected)  region  that  is  very  long  relative 
to  its  width,  and  has  a  ridge  or  skeleton  along 
which  the  intensities  change  slowly  and  are 
distinguished  from  those  outside  the  region;  the 
width  need  not  be  constant,  but  any  changes  in 
width  should  occur  in  a  smooth  manner.  (To 
simplify  our  discussion,  we  will  assume  the  linear 
structures  are  distinguished  by  their  ridge  points 
being  brighter  than  the  surrounding  background;  but 
any  other  specified  attribute,  which  is  locally 
detectable,  would  be  an  acceptable  aubetltute.)  It 
la  important  to  recognize  the  fact  that  a  claarly 
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visible  line  In  an  Image  may  have  no  locally 
detectable  edges  (and  thus  no  locally  measurable 
width),  or  possibly  only  one  detectable  edge,  or 
even  two  edges  which  are  significantly  separated 
and  nonparallel  (e.g.,  as  might  occur  In  a  local 
widening  of  a  river).  Finally,  It  will  generally 
be  the  case  that  linear  structures  have  no  visible 
Internal  detail  that  Is  essential  to  their 
delineation. 

As  a  point  of  Interest,  It  might  be  noted  that 
the  mechanisms  for  generating  subjective  edge  and 
line  Illusions  are  quite  different;  subjective 
edges  appear  to  require  a  3-D  Interpretation,  while 
subjective  lines  appear  to  be  produced  by 
adaptation  phenomena. 


SMOOTHING,  DISTANCE  TRANSFORMS, 
AND  THE  GRAY  SCALE  SKELETON 


If  we  can  find  the  edges  of  a  linear  structure 
(LS),  we  can  generate  a  distance  transform  and 
extract  a  skeleton  as  the  desired  delineation 
(e.g.,  Rosenfeld  [1],  Fischler  [2]).  However,  as 
we  noted  In  the  preceding  section,  LS  do  not  always 
have  locally  detectable  edges,  and,  since  all  of 
the  generally  known  techniques  for  deriving  a 
skeleton  require  a  complete  contour,  some  other 
approach  Is  required.  (The  classical  skeletonizing 
techniques  intimately  link  the  contour/edges  of  a 
region  and  Its  skeleton,  and  It  Is  just  this 
linkage  that  we  wish  to  break). 

Surprisingly,  something  equivalent  to  a 
distance  transform  that  works  on  gray  scale  Images 
(and  on  binary  Images  as  well)  Is  already 
available.  To  achieve  our  purpose,  we  need  only 
observe  that  the  Intensities  In  a  properly  smoothed 
Image  can  be  considered  to  be  the  values  of  an 
approximate  distance  transform.  What  Is  the  best 
smoothing  function  for  general  use?  Actually,  It 
doesn't  seen  to  make  much  difference  In  many  cases. 
Most  digital  Images  have  been  processed  by  low-pass 
optical  and  electronic  systems  that  have  Inserted 
the  required  minimum  level  of  smoothing.  The 
viewpoint  that  the  smoothed  Image  can  be  considered 
to  be  a  distance  transform  Is  the  essential 
element.  If  we  start  with  a  binary  Image,  or  a 
very  noisy  Image,  than  additional  smoothing  Is 
desirable  and  possibly  evan  necessary.  Since  we 
are  not  concerned  with  blurring  edges,  and  we  would 
like  to  eliminate  (blur)  any  structure  or  texture 
Internal  to  the  linear  regions,  we  want  the 
smoothing  function  to  have  a  width  of  at  least  1/2 
that  of  the  region  to  be  delineated.  If  we 
Increase  the  width  of  the  smoothing  function,  we 
eventually  eliminate  the  thinner  linear  structures, 
and  thus.  If  we  wish  to  find  all  possible  LS 
without  prior  knowledge  of  the  content  of  the 
Image,  the  processing  should  be  repeated  with  a  set 
of  filters  having  a  spectrum  of  widths.  Actually, 
no  more  than  1  or  2  filtering  steps  should  ever  be 
required.  For  example,  to  trace  all  the  linear 
structures  (diameters  up  to  20  pixels)  la  a  noisy 
radiographic  angiogram,  a  single  filter  of  width  10 


was  used  (see  Figure  1).  In  a  256X256  Image,  we 
could  probably  trace  all  the  linear  structures  with 
a  diameter  of  up  to  approximately  1/A  of  the  Image 
width  by  employing  just  two  filters  (widths  10  and 
30).  To  trace  the  roads  in  an  aerial  Image,  the 
filtering  produced  by  reducing  the  image  from 
1024X1024  to  256X256  was  sufficient  to  achieve 
excellent  results  (see  Figure  4). 


RIDGES  (OR  VALLEYS),  OPERATORS, 
AND  NEIGHBORHOODS 


Having  produced  an  approximate  distance 
transform  via  smoothing,  we  now  must  deal  with  the 
problem  of  locating  the  ridge  points  that  denote 
the  spines  (skeletons)  of  the  LS.  When  an  exact 
distance  transform  is  derived  from  a  complete 
contour,  noise  Is  not  a  problem  and  the  skeleton 
has  assured  geometric  properties  that  make  It  easy 
to  detect;  finding  the  ridge  points  of  an 
approximate  distance  transform  Is  considerably  more 
complex. 

We  traditionally  distinguish  between  locally 
ana  globally  detectable  features:  local  features 
are  detectable  by  an  Intensity  pattern  which  can  be 
observed  through  a  small  peep-hole  centered  on  the 
feature,  while  global  features  are  ambiguous  In  a 
small  area.  The  model  or  description  of  the  local 
feature  is  generally  compiled  into  an  Intensity 
patch  (matched  filter  or  operator)  which  can  be 
convolved  with  the  Image  to  detect  the 
corresponding  feature.  In  the  case  of  an  exact 
distance  transform,  a  3X3  pixel  operator  Is 
sufficient  to  detect  ridge  points  (a  2X2  operator 
Is  sufficient  for  the  Labeled  Distance  Transfons 
(Fischler  [2]));  for  the  approximate  distance 
transfoim,  a  small  fixed-size  operator  Is 
Ineffective. 

The  principal  utility  of  a  local  operator  la 
that  the  nuaber  of  data  patterns  the  operator  might 
encounter  Is  small  enough  to  allow  one  to  enumerate 
a  decision  for  each  such  pattern.  If  we  further 
agree  to  use  a  small  square  window  of  the  Image  as 
our  local  doauln,  and  to  use  either  table  look-up 
or  convolution  as  the  basis  for  decision  making, 
then  a  uniform  mechanisation  can  be  employed  to 
Implement  a  large  number  of  distinct  (generally 
unrelated)  local  operators.  The  attractiveness  of 
this  second  (Implementation)  aspect  has  led  to  the 
situation  that  almost  all  loir-level  (local)  scene 
analysis  Is  done  using  such  peep-hole  type 
operators.  The  disadvantage  of  this  approach  Is 
that  the  concept  of  local  Is  relative  to  the  size 
of  the  entity  of  Interest,  and  either  one  must  know 
this  size  In  advance,  or  use  a  whole  femlly  of 
operators  of  Increasing  size,  where  the  larger 
operators  loose  the  advantages  that  led  to  their 
use  In  the  first  place.  In  the  case  of  line 
detection,  where  the  line  width  can  vary  over  a 
wide  range  of  values,  the  conventional  operator 
concept  Is  Inappropriate. 
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Based  on  these  general  Issues  (even  more  so 
than  on  the  Immediate  problem  at  hand),  we  have 
considered  other  realizations  of  general  'local* 
decision-making  processes  that  satisfy  the 
previously  stated  conditions,  but  do  not 
necessarily  lend  themselves  to  a  convolution  type 
mechanization.  In  particular,  restricting  our 
attention  to  finding  the  maxima  and  minima  of 
functions  of  the  displacement  along  a  space  curve 
defined  over  the  Image,  satisfies  our  requirements 
for  computational  and  declslon-oaklng  simplicity 
even  when  the  curve  traverses  the  entire  image. 
While  the  space  curve  might  assume  any  shape  (e.g., 
follow  the  contour  of  an  object),  the  analysis 
Itself  Is  Independent  of  the  shape;  for  the  linear 
delineation  problem,  we  used  image  intensity  as  a 
function  of  displacement  along  horizontal  and 
vertical  scan  lines.  Since  maxima  and  minima  are 
symmetrical  attributes,  we  will  only  discuss  the 
problem  of  labeling  maximal  points  along  the  curve. 

The  problem  of  finding  the  ridge  points  of  an 
approximate  distance  transform  can  be  viewed  as  the 
problem  of  finding  the  ridge  points  (local  maxima) 
of  an  exact  distance  transform  to  which  some  amount 
of  noise  has  been  added.  We  are  not  concerned 
about  the  possibility  of  making  isolated 
(incoherent)  incorrect  decisions,  since,  in  the 
next  section,  we  will  describe  powerful  linking  and 
pruning  methods  capable  of  eliminating  such  errors 
and  even  eliminating  the  valid  as  well  as  invalid 
weak  coherent  structures  that  are  detected.  Our 
main  problem  Is  that  we  can't  count  on  finding 
either  large  local  gradients  or  using  known  line 
width  to  determine  some  minimum  significant 
gradient  threshold,  to  Identify  valid  ridge  points, 
additionally,  noise  will  Introduce  many  false  local 
maxima.  Thus,  we  must  use  total  intensity  change, 
rather  than  rate  of  change,  to  detect  valid  ridge 
points;  and  we  must  have  an  effective  way  of 
determining  such  total  change  even  in  the  presence 
of  local  variation  Introduced  by  noise.  (While  it 
is  not  immediately  obvious  that  total  intensity 
change,  rather  than  rate  of  change,  will  recover 
the  perceptually  obvious  linear  features,  our 
experiments  indicate  that  this  Is  Indeed  the  case.) 

Our  approach  is  to  detect  two  distinct  types 
of  Intensity  maxima  along  the  space  curves  (in  this 
case,  horizontal  and  vertical  scan  lines),  to  which 
we  assign  the  designations  'local'  and  'global' 
maxima.  The  local  maxima  measure  of  a  point  is  the 
total  intensity  difference  from  the  point  to  the 
highest  of  Its  Immediate  left  and  right  minima. 
The  left  (right)  global  maxima  measure  of  a  point 
la  the  total  Intensity  difference  from  the  point  to 
the  lowest  value  found  moving  to  the  left  (right) 
prior  to  encountering  a  point  with  an  intensity 
value  equal  to,  or  greater  than,  that  of  the  given 
point;  the  global  maxima  measure  of  the  point  is 
the  smaller  of  its  left  and  right  global  measures. 
If  a  point  has  an  Immediate  neighbor  with  a  higher 
Intensity  value.  It  la  not  a  maximal  point  and  It 
Is  not  assigned  either  a  local  or  global  value 
(actually,  for  Implementation  purposes.  It  Is 
assigned  a  zero  value);  on  the  other  hand,  every 
maximal  point  will  have  both  a  local  and  global 
value  where  the  global  value  equals  or  exceeds  the 
local  value. 


We  have  been  proceeding  under  the  assumption 
that  a  large  local  Intensity  maxima  (LIM)  denotes  a 
significant  event,  but,  in  tb;  presence  of  large 
variations  In  image  Intensity  or  noise,  the  global 
intensity  maxima  (GIM)  would  be  a  better  detector 
of  significant  intensity  variation;  however,  in  a 
well-smoothed  or  relatively  noise-free  image,  there 
might  be  very  little  difference  In  the  information 
contained  In  the  LIM  and  GIM  measures.  There  is 
also  the  Issue  of  deciding  what  Is  a  large-enough 
value,  of  either  the  LIM  or  GIM,  to  indicate 
significance.  In  the  example  shown  In  Figure  2, 
only  1/3  of  the  points  were  maximal  points,  and,  In 
a  smoothed  image,  this  percentage  should  be  much 
smaller.  Given  the  powerful  linking  and  pruning 
techniques  we  describe  in  the  next  section.  It 
might  be  possible  to  return  all  the  maximal  points 
in  a  binary  mask  and  still  extract  the  desired  line 
structure  from  the  background  noise  contained  in 
such  a  mask.  However,  it  makes  much  more  sense  to 
first  eliminate  those  maximal  points  that  do  not 
have  enough  Intensity  variation  to  be  perceptually 
distinguishable  from  a  flat  background.  (It  would 
even  appear  that  we  could,  without  loosing 
essential  information,  eliminate  those  maximal 
points  with  a  total  variation  less  than  that 
required  to  perceive  them  against  a  random-noise 
field  with  the  same  statistical  variation  as  the 
measured  variation  over  some  surrounding 
neighborhood  in  the  image.) 

We  are  still  working  on  the  question  of  what 
constitutes  an  optimal  threshold  for  maximal  point 
intensity  variation  (in  order  to  extract  a  binary 
mask  to  represent  the  major  linear  structures  in 
the  image)  and  we  expect  that  the  ideas  mentioned 
above  will  soon  lead  to  an  acceptable  solution. 
Our  optimism  Is  based  on  the  fact  that  we  have  been 
able  to  use  one  fixed  pair  of  threshold  values  (20 
intensity  units  out  of  256  for  the  LIM,  and  40  for 
the  GIM)  for  all  of  our  experiments  and  still 
generally  obtain  very  good  results  (for  example, 
see  Figures  1-5);  thus,  we  already  have  a  heuristic 
solution  that  avoids  the  need  for  manual 
intervention,  and  there  appears  to  be  a  scientific 
basis  for  obtaining  a  more  effective  answer. 


VI  CLUSTERING,  LINKING,  PRUNING,  RANKING, 
AND  FINAL  DELINEATION 


(This  work  will  be  described  In  detail  In  a 

Saper  currently  nearing  completion;  Fischler  and 
olf  [31) 

Based  on  the  availability  of  a  high  quality 
binary  overlay  depicting  the  locations  of  the  major 
linear  structures  contained  In  the  given  gray  scale 
Image,  obtained  as  described  In  the  preceding 
section,  we  have  been  able  to  demonstrate  that: 

(l)  The  linking  step  In  the  delineation 
process  can  effectively  be  based  on  the 
single  attribute  of  geometric  proximity, 
and  that  a  clustering  or  association 
step,  followed  by  the  construction  of  a 
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Minima  Spanning  Tree  (MST)  through  the 
points  of  each  cluster  (Flschler  [4]), 
will  correctly  link  the  ridge  points 
along  the  skeletons  of  the  linear 
structures  (see  Figures  4  and  S). 

(2)  The  desired  delineations  will  be  ea bedded 
In  trees  containing  additional  branches 
that  are  either  alnor  linear  structures 
or  noise,  and  that  simple  pruning 
techniques  can  eliminate  most  of  this 
unwanted  detail  (see  Figure  3;  note  that 
tree  pruning  can  effectively  achieve 
simplifications  that  would  be  difficult. 
If  net  impossible,  at  lower  levels  of 
organisation  of  the  Information). 

(3)  Having  properly  linked  the  ridge  points 
and  pruned-  some  of  the  smaller  branches 
of  the  resulting  trees,  we  can  extract 
long  coherent  paths  by  a  decision 
procedure  applied  at  each  node  of  each 
tree.  This  decision  procedure,  based  on 
the  (local  to  the  node)  branch  attributes 
of  intensity,  connectivity,  and 
directionality,  assigns  path  connectivity 
through  a  node  by  splitting  off 
Incompatible  branches;  any  remaining 
ambiguities  (s»re  than  two  branches 
entering  a  node)  are  resolved  by  choosing 
those  pairings  that  result  In  the  longest 
oaths.  (See  Figure  3) 

(4)  The  paths  obtained  in  the  tree 
partitioning  step  can  be  rank  ordered 
with  respect  to  perceptual  quality  by  a 
metric  baaed  on  the  path  attributes  of 
total  length,  average  intensity,  local 
Intensity  variation,  and  continuity 
(l.e.,  ratio  of  total  Internal  gap  length 
to  path  length).  (See  Figure  4) 


VII  CONCLUDING  COMMENTS 

He  have  presented  the  viewpoint  that  the 
problem  of  delineating  the  obvious  linear 
structures  In  an  Image  Is  distinct  frou  that  of 
finding  edges  or  contours,  and  is  best  viewed  as 
the  process  of  finding  skeletons  in  gray  scale 
Images  (l.e.,  that  line  detection  does  not 
necessarily  depend  on  gradient  Information,  but 
rather  Is  approachable  frou  the  standpoint  of 
detecting  total  Intensity  variation);  as  a 
necessary  step  In  this  process,  we  have  suggested 
that  an  approximate  gray  scale  distance  transform 
can  be  attained  by  smoothing  the  original  Image. 
He  have  described  an  effective  technique  for 
finding  ridge  points  (points  on  the  delineating 
skeleton),  and.  In  the  process,  raised  some 
Important  questions  about  the  conventional  approach 
to  designing  'local  operators.* 

Horklng  with  both  the  binary  overlay  (produced 
as  discussed  above)  and  the  original  gray  scale 
image,  we  have  demonstrated  via  examples  that  the 
remaining  steps  In  the  delineation  process  can  be 
effectively  achieved. 


Our  goal  in  this  work  has  been  to  approach 
human  levels  of  performance  In  finding  perceptually 
obvious  delineations  In  images  selected  at  random 
from  a  reasonably  broad  class  of  scene  domains,  and 
without  any  hinan  Intervention  or  prior  knowledge 
about  the  Image  content.  We  believe  that  this  goal 
can  be  achieved  through  extension  and  refinement  of 
the  techniques  described  in  this  paper. 
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c)  Smoothed  gr%y  scale  Image 


Intensity  profiles  of  smoothed  Image 


FIGURE  1  —  IMAGE  SMOOTHING,  AND  FINDING  LINEAR  FEATURE  POINTS  IN  THE 
INTENSITY  VALLEYS  OF  A  RADIOGRAPHIC  IMAGE  (ANGIOGRAM) 
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a)  Original  gray  scale  image 


b)  Intensity  profiles  of  original  image 


e)  Terrain  map  of  intensity  surface  of  smoothed 
image 


f)  Linear  feature  points  found  in  valleys  of 
iptenslty  surface 


)  Original  gray  scale  image  b)  Location  ard  magnitude  of  global  intensity 

maxima 


c) 


Location  and  magnitude  of  local  intensity 
maxima 


d)  Global  intensity  maxima  with  magnitude 
greater  than  20 


trimming  steps 


e)  Maximum  length  segment  extracted  from  trimmed  n  , 

clu8ter  f)  £"l8y  °f  “xlBUB  len*th  segment  extracted 

rrom  trimmed  cluster 

FIGURE  3  -  CLUSTERING.  PRUNING,  and  LINEAR  PATH  EXTRACTION 


a)  Original  gray  scale  picture 


b)  Overlay  of  detected  linear  feature  point 


c)  Four  best  segments  found  in  linear  feature 
points 


d)  Overlay  of  four  best  segments  found  in  : inear 
feature  points 


e)  All  segments  found  in  linear  feat  *re  points 


f)  Overlay  of  all  segments  found  in  linear 


feature  points 


FIGURE  4  —  CLUSTERING,  LINKING,  AND  LINEAR  PATH  EXTRACTION  IN  AN  AERIAL  IMAGE 
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Original  gray  scale  picture 
(data  area  enclosed  in  box) 


Overlay  of  detected  linear  feature  point 


c)  All  segments  found  in  linear  feature  point 


d)  Overlay  of  all  segments  found  in  linear 
feature  points 


FIGURE  5  —  CLUSTERING,  LINKING,  AND  LINEAR  PATH  EXTRACTION  IN  AN 
INDUSTRIAL  IMAGE  (PRINTED  CIRCUIT  BOARD) 
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ABSTRACT 

Most  processes  that  use  the  disparities 
resulting  from  stereo  i.nage  matching  make  the 
unrealistic  assumption  that  every  natch  obtained 
was  correct.  Unfortunately,  match  errors  can  occur 
for  a  variety  of  image-  and  terrain-related 
reasons.  Such  errors  produce  undesirable  effects 
when  the  disparity  data  is  further  processed  (for 
instance,  to  form  a  scene  model) ,  and  should  be 
detected  and  corrected  as  early  in  the  processing 
as  possible. 


Algorithms  have  been  developed  to  detect 
errors  in  disparities  derived  from  aerial 
photographs  of  terrain.  These  algorithms  constrain 
the  allowable  magnitude  of  the  second  derivatives 
of  disparities  in  local  areas  around  each  point. 
Relaxation-like  techniques  are  employed  in  the 
Iteration  of  the  detection  process. 


INTRODUCTION 


'V 


threshold  based  on  the  expected  correlation  of 
random  noise  tor  a  given  window  size,  and  sometimes 
with  a  threshold  based  on  the  information  content 
of  the  window  being  matched  [Hannah,  1974] .  Thef  • 
methods  can  give  false  indications  of  mismatches 
the  presence  of  noise,  uncorrected  distortion, 
scene  discontinuities,  and  they  can  falsely  ver 
an  inappropriate  match  due  to  a  moving  object. 

What  is  needed  is  a  technique  for  evaluatin' 
set  of  latches  to  determine  which,  if  any,  of  t 
are  inconsistent  with  the  rest.  Global  techniqu 
such  as  fitting  polynomials  to  the  disparity  di 
and  evaluating  each  point  on  how  well  the 
polynomials  fit  it,  have  serious  drawbacks.  Such 
techniques  apply  the  same  criteria  to  all  areas  of 
a  set  of  matches,  despite  the  fact  that  terrain 
roughness,  hence  disparity  differences,  can  vary 
greatly  from  one  area  to  another.  These  variations 
can,  for  instance,  cause  a  moderate  error  in  a  flat 
area  to  be  missed  while  a  large  change  in  disparity 
at  a  mountain  peak  in  a  rough  area  is  mistakenly 
marked  as  an  error. 


Most  stereo  matching  processes  assume  that 
every  match  obtained  will  be  a  correct  match. 
Unfortunately,  this  assumption  is  unrealistic. 
Many  conditions  can  cause  false  matches,  including 
low  image  information,  strong  linear  edges,  image 
raise,  repetitive  visual  textures,  distortions  due 
to  chiunges  in  the  point  of  view,  and  obscurations 
due  to  scene  discontinuities  or  moving  objects. 

Some  of  these  conditions,  such  as  low 
information,  strong  linear  edges,  and  repetitive 
visual  textures,  can  be  detected  in  a  single  image. 
Matching  of  these  areas  can  then  be  deferred  until 
more  global  information  (such  as  a  set  of  reliable 
adjacent  matches)  Is  available  to  guide  the 
matching  process.  Potential  mismatches  due  to 
distortion,  obscuration,  or  motion  cannot  be 
detected  from  a  single  image.  Instead,  the 
matching  process  or  its  results  must  be  used  to 
determine  that  a  mismatch  has  occurred.  Steps  can 
then  be  taken  to  cope  with  the  problem. 

Traditional  methods  for  detecting  mismatches 
are  based  on  the  concept  that  correlation 
coefficient  values  will  be  low  if  an  area  is  paired 
with  anything  other  than  it*  proper  match.  This 
condition  has  been  detected  by  thresholding  the 
correlation  value— sometimes  with  an  arbitrary 
threshold  (0.5  is  popular),  sometimes  with  a 


DISPARITY  CONSTRAINTS 

A  more  promising  approach  is  the  use  of 
relaxation-like  techniques  for  detecting  local 
errors  in  the  disparities  by  the  application  of 
constraints  on  the  changes  in  the  disparities. 
(Relaxation  is  an  iterative  technique  for 
establishing  the  likeliness  of  a  data  point, 
depending  on  how  wall  that  point  is  supported  by 
neighboring  data  points  [Rosenfeld,  et.al.,  1976).) 
A  similar  approach  has  been  developed  for  the 
detection  of  terrain  elevation  inconsistencies  and 
has  been  used  successfully  to  evaluate  and  rectify 
full  grids  of  digital  elevation  data  [Hannah, 
1981].  To  be  useful  in  the  application  of 
assessing  the  reliability  of  a  set  of  matches,  the 
technique  has  been  extended  from  grids  of  data 
points  to  arbitrarily  spaced  points,  and  from 
elevation  data  to  two-dimensional  disparity 
vectors.  (The  disparity  approach,  like  its  parent 
terrain  elevation  evaluation  algorithm,  assumes 
that  the  scene  being  evaluated  Is  continuous  and 
reasonably  smooth,  and  that  there  are  sufficient 
disparities  to  adequately  sample  the  variations  in 
the  surface.) 

In  the  grid-based  elevation  approach,  each 
point  had  obvious  neighbors— the  horizontally, 
vertically,  and  diagonally  adjacent  points  on  the 
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grid.  In  the  general  case,  the  concept  of 
neighborhood  implicit  in  the  grid  does  not  exist, 
so  must  be  explicitly  created  as  a  pre-processing 
step.  The  matches  to  be  evaluated  are  first 
paired,  i.e.  each  point  to  be  evaluated  is 
associated  with  the  N  points  which  lie  closest  to 
it  in  the  image.  This  process  is  constrained  by 
first  dividing  the  image  into  a  grid  of  "buckets"; 
neighboring  points  can  only  come  from  the  current 
or  one  of  the  eight  adjacent  buckets.  Triples  are 
formed  at  each  point  by  combining  that  point  with 
any  two  of  its  neighbors  such  tiiat  the  angle  formed 
by  the  three  points  is  within  a  specified  angle 
(usually  45  degrees)  of  180  degrees,  i.e.,  the 
three  points  are  nearly  co-linear.  Only  the  M 
straightest  triples  are  kept  for  each  point.  Note 
that  a  point  may  participate  in  up  to  M  triples  as 
the  center  point,  plus  up  to  N  triples  as  one  of 
the  endpoints. 

Each  triple  k  has  two  values  associated  with 
it.  One  is  an  estimation  of  the  reasonableness  of 
that  sequence  of  three  disparities  which  is  based 
on  the  vector  second  derivative  of  the  disparities 

DERIV2 (k)  =  (  DISP(cent(k) )  -  DISP(endl (k) )  ) 


DIST(cent(k) ,endl (k) ) 

(  DISP(end2 (k) )  -  DISP(cent(k) )  ) 


DIST(cent(k) ,end2 (k) ) 

where  DXSP  is  the  disparity  vector  at  a  point  of 
the  triple,  DIST  is  the  distance  between  the  center 
and  an  endpoint  of  the  triple,  and  'cent',  'endl', 
and  'end2'  are  the  center  and  two  endpoints  of  the 
triple,  respectively.  Hie  components  of  this 
vector  are  compared  to  thresholds  established  by 
the  user  to  form  EST(k) ,  the  estimate  of 

reasonableness  for  this  triple.  EST  has  values  in 
the  range  of  -1.0  to  1.0  (0.0  indicates  a 

reasonable  center  disparity,  1.0  indicates  a 
disparity  chat  is  too  large,  and  -1.0  indicates  a 
disparity  that  is  too  small) .  Hie  second  value 
associated  with  the  triple  is  WT(k),  a  weight  to  be 
used  with  the  reasonableness  value;  this  number  is 
based  on  the  amount  by  which  the  three  points  of 
the  triple  deviate  from  a  straight  line,  so  that 
straight  triples  have  more  effect  on  the  result 
than  do  bent  triples. 

The  reliability  R  of  each  point  is  now 
established  iteratively.  The  reliabilities  are 
initially  estimated  to  be  the  value  of  the 
correlation  coefficient  associated  with  each 
matched  point.  (We  assume  that  all  matches  with 
negative  correlations  have  been  rejected!)  This  is 
iteratively  refined  by 

Sum(  REL(i,x,n-l)  *  EST (k)  ) 
k 

R(i,n)  *  1.0  -  Abs(-  ■  ■'  ) 

Sum  (  REL(i,k,n-l)  ) 
k 


estimates  EST  taken  over  all  triples  k  involving 
the  point  i.  Hie  weighting  factor  REL  is  the 
product  of  the  ’’straightness"  weight  WT  associated 
with  the  triple  and  the  minimum  of  the 
reliabilities  R  of  the  other  two  points  in  the 
triple  on  the  previous  iteration,  i.e. 

R£L(i,k,n-l)  =  WT(k) *Minimum(R( jl,n-l) ,  R ( j2, n— 1 ) ) 

jl,j2  in  k,  but  not  =  i 

Thus  the  reasonableness  of  a  triple  has  an  effect 
in  proportion  to  the  minimal  reliability  of  the 
triple,  preventing  a  bad  match  from  prejudicing  the 
evaluation  of  points  around  it. 

Hie  reliabilities  are  iterated  until  they 
appear  to  have  converged,  that  is,  until  they 
change  less  than  a  user-specified  threshold  from 
one  iteration  to  the  next.  In  practice,  it  usually 
requires  three  iterations  for  the  reliabilities  to 
converge  within  a  tolerance  of  0.01. 

RESULTS 

Figure  1  shows  the  results  of  applying  this 
algorithm.  The  underlying  image  data  has  been 
formed  into  a  reduction  hierarchy  with  reduction 
factor  N  =  2.  The  points  to  be  matched  were  chosen 
by  an  "interest  operator",  which  selects  well¬ 
spaced  areas  having  reasonable  information  content. 
Each  point  was  then  matched  by  a  hierarchical 
matching  technique  (Hannah,  1080) .  The  figure 
shows  the  results  of  this  matching  by  overlaying  a 
symbol  at  the  positions  of  the  matching  points  in 
each  image.  In  addition,  each  point  in  the  left 
image  has  emanating  from  it  the  disparity  vector 
associated  with  that  match,  in  effect  pointing  to 
the  position  occupied  by  the  matching  point  in  the 
right  image. 

Examination  of  the  disparity  vectors  reveals 
some  obvious  mismatches  -most  of  the  disparity 
vectors  point  strongly  to  the  left  and  up  slightly, 
but  two  near  the  top  have  significant  downward 
components.  Hie  matches  have  been  graded  based  on 
their  reliability;  the  numeral  which  marks  each 
match  is  the  first  d;,<it  of  the  reliability  (i.e., 
matches  marked  with  a  9  have  a  reliability  greater 
than  0.9).  Hie  algorithm  has  correctly  identified 
the  two  obvious  mismatches  near  the  top  of  the 
image  (graded  0  and  1),  along  with  a  less  obvious 
one  in  the  lower  left  quarter  of  the  image  (graded 
2).  TVo  fairly  subtle  mismatches  near  the  ridge 
line  both  received  reasonably  high  grades  of  7;  all 
other  matches  are  corr».  t  to  within  one  pixel. 
(The  correctness  of  the  matches  was  established  by 
comparison  to  results  obtained  by  the  U.  S.  Army 
Engineer  Topographic  Laboratories,  Ft.  Belvoir,  VA, 
using  an  interactively  "coached"  correlation 
matching  process.  The  author  is  indebted  to  ETL 
for  this  imagery  and  disparity  data.) 


That  is,  the  reliability  R  of  the  i-th  point  on  the 
n-th  iteration  is  1.0  minus  the  absolute  value  of 
the  weighted  average  of  the  reasonableness 


CONCLUSIONS 


* 


This  algorithm  for  the  determination  of  the 
reliability  of  a  set  of  match  performs  quite 
creditably  in  the  domain  for  which  it  was  designed. 
In  this  example,  it  correctly  evaluated  49  out  of 
51  matches,  missing  only  two  subtle  errors. 
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Figure  1.  Reliabilities  of  Matched  Image  Points. 
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This  paper  describes  the  results  obtained  In 
a  research  program  ultimately  concerned  with 
deriving  a  physical  sketch  of  a  scene  from  one  or 
more  Images.  Our  approach  Involves  modeling 
physically  meaningful  Information  that  can  be  used 
to  constrain  the  Interpretation  process,  as  well 
as  modeling  the  actual  scene  content.  In 
particular,  we  address  the  problems  of  modeling 
the  Imaging  process  (camera  and  Illumination),  the 
scene  geometry  (edge  classification  and  surface 
reconstruction),  and  elements  of  scene  content 
(material  composition  and  skyline  delineation). 

I  INTRODUCTION 

Images  are  Inherently  ambiguous 
representations  of  the  scenes  they  depict:  Images 
are  2-D  views  of  3-D  space,  they  are  single  slices 
In  time  of  ongoing  physical  and  semantic 
processes,  and  the  light  waves  from  which  the 
Images  are  constructed  convey  limited  Information 
about  the  surfaces  from  which  these  waves  are 
reflected.  Therefore,  Interpretation  cannot  be 
strictly  based  on  Information  contained  In  the 
Image;  It  must  Involve,  additionally,  seme 
combination  of  a  priori  models,  constraints,  and 
assumptions.  In  current  machine-vision  systems 
this  additional  Information  Is  usually  not  made 
explicit  as  part  of  the  machine's  data  base,  but 
rather  resides  In  the  human  operator  who  choses 
the  particular  techniques  and  parameter  settings 
to  reflect  his  understanding  of  the  scene  context. 
This  paper  describes  a  portion  of  the  SRI  program 
In  machine  vision  research  that  Is  concerned  with 
Identifying  and  modeling  physically  meaningful 
Information  that  can  be  used  to  automatically 
constrain  the  interpretation  process.  In 
particular,  as  an  adjunct  to  any  autonomous  system 
with  a  generalised  competence  to  analyse  Imaged 
data  of  3-D  real-world  scenes,  we  believe  the t  It 
le  necessary  to  explicitly  model  and  use  the 
following  types  of  knowledge: 


(1)  Camera  model  and  geometric  constraints 
(location  and  orientation  In  space  from 
which  the  Image  was  acquired,  vanishing 
points,  ground  plane,  geometric  horlson, 
geometric  distortion). 

(2)  Photometric  and  llluslnatlon  models 
(atmospheric  and  Image-processing  system 
Intensity-transfer  functions,  location 
and  spectrum  of  sources  of  illumination, 
shadows,  highlights). 

(3)  Physical  surface  models  (description  of 
the  3-D  geometry  and  physical 
characteristics  of  the  visible  surfaces; 
e.g.,  orientation,  depth,  reflectance, 
material  composition). 

(4)  Edge  classification  (physical  nature  of 
detected  edges;  e.g.,  occlusion  edge, 
shadow  edge,  surface  Intersection  edge, 
material  boundary  edge,  surface  marking 
edge). 

(5)  Delineation  of  the  visible  horizon 
(skyline) 

(6)  Semantic  context  (e.g.,  urban  or  rural 
scene,  presence  of  roads,  buildings, 
forests,  mountains,  clouds,  large  water 
bodies,  etc.). 

In  the  remainder  of  this  paper,  we  will 
describe  in  greater  detail  the  nature  of  the  above 
models,  our  research  results  concerning  how  the 
parameters  for  some  of  these  models  can  be 
automatically  derived  from  image  data,  and  how  the 
models  can  be  used  to  constrain  the  Interpretation 
process  in  such  tssks  as  stereo  compilation  and 
image  matching. 

If  we  categorise  constraints  according  to  the 
scope  of  their  Influence,  then  the  work  we  describe 
Is  primarily  concerned  with  global  and  extended 
constraints  rsther  than  with  constraints  having 
only  a  local  Influence.  To  the  extent  that 
constraints  can  be  categorised  as  geometric, 
photometric,  or  semantic  and  scene  dependant.  It 
would  appear  that  we  have  made  the  moet  progress  In 
understanding  and  modeling  the  geometric 
constraints. 
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Engineer  Topographic  Laboratory  and  by  the  U.  S.  Army  Research  Office. 
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II  CAMERA  MODELS  AND  GEOMETRIC  CONSTRAINTS 


The  camera  model  describes  the  relationship 
between  the  Imaging  device  and  the  scene;  e .g • , 
where  the  camera  is  in  the  scene,  where  it  is 
looking,  and  more  specifically,  the  precise  mapping 
from  points  in  the  scene  to  points  in  the  image. 
In  attempting  to  match  two  views  of  the  same  scene 
taken  from  different  locations  in  space,  the  camera 
model  provides  essential  information  needed  to 
contend  with  the  projective  differences  between  the 
resulting  Images. 

In  the  case  of  stereo  reconstruction,  where 
depth  (the  distance  from  the  camera  to  a  point  in 
the  scene)  is  determined  by  finding  the 
corresponding  scene  point  in  the  two  images  and 
using  triangulation,  the  camera  models  (or  more 
precisely,  the  relative  camera  model)  limit  the 
search  for  corresponding  points  to  one  dimension  in 
the  image  via  the  "ept polar”  constraint.  The  plane 
passing  through  a  given  scene  point  and  the  two 
lens  centers  Intersects  the  two  image  planes  along 
straight  lines;  thus  a  point  in  one  image  must  lie 
along  the  corresponding  (eplpolar)  line  in  the 
second  image,  and  one  need  only  search  along  this 
line,  rather  than  the  whole  image  to  find  a  match. 

When  human  interaction  is  permissible,  the 
camera  model  can  be  found  by  having  the  human 
identify  a  number  of  corresponding  points  in  the 
two  Images  and  using  a  least-squares  technique  to 
solve  for  the  parameters  of  the  model  [5].  If 
finding  the  corresponding  points  must  be  carried 
out  without  human  intervention,  then  the 
differences  in  appearance  of  local  features  from 
the  two  viewpoints  will  cause  a  significant 
percentage  of  false  matches  to  be  made;  under  these 
conditions,  least  squares  is  not  a  reliable  method 
for  model  fitting.  Our  approach  to  this  problem 
(3]  is  based  on  a  philosophy  directly  opposite  to 
that  of  least-squares  —  rather  than  using  the  full 
collection  of  matches  In  an  attempt  to  "average 
out”  errors  in  the  model-fitting  process,  we 
randomly  select  the  smallest  number  of  points 
needed  to  solve  for  the  camera  model  and  then 
enlarge  this  set  with  additional  correspondences 
that  are  compatible  with  the  derived  model.  If  the 
sice  of  the  enlarged  compatibility  set  is  greater 
than  a  bound  determined  by  simple  statistical 
arguments,  the  resulting  point  set  is  passed  to  a 
least-squares  routine  for  a  more  precise  solution. 
We  have  been  able  to  show  that  as  few  as  three 
correspondences  are  sufficient  to  directly  solve 
for  the  camera  parameters  when  the  three-space 
relationships  of  the  corresponding  points  are 
known;  a  recent  result  [13]  indicates  that  5  to  8 
points  are  necessary  to  solve  for  the  relative 
camera  model  parameters  when  three  space 
information  is  not  available  a  priori. 

The  perspective  imaging  process  (the  formation 
of  Images  by  lenses)  Introduces  global  constraints 
that  are  Independent  of  the  explicit  availability 
of  a  camera  model;  particularly  Important  are  the 
detection  and  use  of  "vanishing  points."  A  set  of 
parallel  lines  In  3-D  space,  such  as  the  vertical 
edges  of  buildings  In  an  urban  scene,  will  project 
onto  the  Image  plane  as  a  set  of  straight  lines 


intersecting  at  a  common  point.  Thus,  for  example, 
if  we  can  locate  the  vertical  vanishing  point,  we 
can  strongly  constrain  the  search  for  vertical 
objects  such  as  telephone  or  power  poles  or 
building  edges,  and  we  can  also  verify  conjectures 
about  the  3-D  geometric  configuration  of  objects 
with  straight  edges  by  observing  which  vanishing 
points  these  edges  pass  through.  The  two 
horizontal  vanishing  points  corresponding  to  the 
rectangular  layout  of  urban  areas,  the  vanishing 
point  associated  with  a  point  of  Illumination  [8], 
and  the  vanishing  point  of  shadow  edges  projected 
onto  a  plane  surface  in  the  scene,  provide 
additional  constraints  with  special  semantic 
significance.  The  detection  of  clusters  of 
straight  parallel  lines  by  finding  their  vanishing 
points  can  also  be  used  to  automatically  screen 
large  amounts  of  Imagery  for  the  presence  of  man¬ 
made  structures. 

The  technique  we  have  employed  to  detect 
potential  vanishing  points  Involves  local  edge 
detection  by  finding  zero-crossings  in  the  image 
convolved  with  both  Gaussian  and  Laplaclan 
operators  [9],  fitting  straight  line  segments  to 
the  closed  zero-crossing  contours,  and  then  finding 
clusters  of  intersection  points  of  these  straight 
lines.  In  order  to  avoid  the  combinatorial  problem 
of  computing  intersection  points  for  all  pairs  of 
lines,  or  the  even  more  unreasonable  approach  of 
plotting  the  infinite  extension  of  all  detected 
line  segments  and  noting  those  locations  where  they 
cluster,  we  have  Implemented  the  following 
technique.  Consider  a  unit  radius  sphere 
physically  positioned  in  space  somewhere  over  the 
image  plane  (there  are  certain  advantages  to 
locating  the  center  of  the  sphere  at  the  camera 
focal  point  if  this  is  known,  in  which  case  it 
becomes  the  Gaussian  sphere  [6,7],  but  any  location 
is  acceptable  for  the  purpose  under  consideration 
here).  Each  line  segment  in  the  image  plane  and 
the  center  of  the  sphere  define  a  plane  that 
Intersects  the  sphere  in  a  great  circle  —  if  two 
or  more  straight  lines  intersect  at  the  same  point 
on  the  image  plane,  their  great  circles  will 
Intersect  at  two  common  points  on  the  surface  of 
the  sphere,  and  the  line  passing  through  the  center 
of  the  sphere  and  the  two  intersection  points  on 
the  surface  of  the  sphere  will  also  pass  through 
the  intersection  point  in  the  image  plane. 


Ill  EDGE  CLASSIFICATION 

An  Intensity  discontinuity  in  an  image  can 
correspond  to  many  different  physical  events  in  the 
scene,  some  very  significant  for  a  particular 
purpose,  and  some  merely  confusing  artifacts.  For 
example,  in  matching  two  images  taken  under 
different  lighting  conditions,  we  would  not  want  to 
use  shadow  edges  as  features;  on  the  other  hand, 
shadow  edges  are  very  important  cues  in  looking  for 
(say)  thin  raised  objects.  In  stereo  matching, 
occlusion  edges  are  boundaries  that  area 
correlation  petehes  should  not  cross  (thsre  will 
also  be  a  region  on  the  "far"  side  of  an  occlusion 
edge  in  which  no  matches  can  be  found);  occlusion 
edges  also  define  a  natural  distance  progression  in 
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an  Image  even  In  the  absence  of  stereo  Information. 
If  It  Is  possible  to  assign  labels  to  detected 
edges  describing  their  physical  nature,  then  those 
Interpretation  processes  that  use  then  can  be  made 
much  more  robust. 

We  have  Implemented  an  approach  to  detecting 
and  Identifying  both  shadow  and  occlusion  edges, 
based  on  the  following  general  assimptlons  about 
Images  of  real  scenes: 

(1)  The  major  portion  of  the  area  In  an  image 
(at  some  reasonable  resolution  for 
Interpretation)  represents  continuous 
surfaces. 

(2)  Spatially  separated  parts  of  a  scene  are 
Independent,  and  their  image  projections 
are  therefore  uncorrelated. 

(3)  Nature  does  not  conspire  to  fool  us;  if 
some  systematic  effect  Is  observed  that 
we  normally  would  anticipate  as  caused  by 
an  expected  phenomena  due  to  Imaging  or 
lighting,  then  It  Is  likely  that  our 
expectations  provide  the  correct 
explanation;  e.g.,  coherence  In  the  Image 
reflects  real  coherence  In  the  scene, 
rather  than  a  coincidence  of  the 
structure  and  alignment  of  distinct  scene 
constituents. 

Consider  a  curve  overlayed  on  an  Image  as 
representing  the  location  of  a  potential  occlusion 
edge  In  the  scene.  If  we  construct  a  series  of 
curves  parallel  to  the  given  one,  then  we  would 
expect  that  for  an  occlusion  edge,  there  would  be  a 
high  correlation  between  adjacent  curves  on  both 
sides  of  the  given  curve,  but  not  across  this 
curve.  That  Is,  on  each  side,  the  surface 
continuity  assumption  should  produce  the  required 
correlation,  but  across  the  reference  curve  the 
assumption  of  remote  parts  of  the  scene  being 
Independent  should  produce  a  low  correlation  score. 
In  a  case  where  the  reference  curve  overlays  a 
shadow  edge,  we  would  expect  a  continuous  high 
(normalized)  correlation  between  adjacent  curves  on 
both  sides  and  across  the  reference  curve,  but  the 
regression  coefficients  should  show  a  discontinuity 
as  we  cross  the  reference  curve.  This  technique  Is 
described  In  greater  detail  In  [14].  Figures  1  and 
2  show  experimental  results  for  shadow  and 
occlusion  edges. 


IV  INTENSITY  MODELING  (&  Material  Classification) 

Given  that  there  Is  a  reasonably  consistent 
transform  between  surface  reflectance  and  Image 
Intensity,  the  exact  nature  of  this  transform  Is' 
not  required  to  recover  rather  extensive 
Information  about  the  geometric  configuration  of 
the  scene.  It  Is  even  reasonable  to  assume  that 
shadows  and  highlights  can  be  detected  without  more 
precise  knowledge  of  the  Intensity  mapping  from 
surface  to  image;  but  If  we  wish  to  recover 
Information  about  actual  surface  reflectance  or 
physical  composition  of  the  scene,  then  the  problem 
of  Intensity  modeling  must  be  addressed. 


Even  relatively  simple  intensity  modeling  must 
address  three  Issues:  (1)  the  relationship  between 
the  Incident  and  reflected  light  from  the  surface 
of  an  object  In  the  scene  as  a  function  of  the 
material  composition  and  orientation  of  the 
surface;  (2)  the  light  that  reaches  the  camera  lens 
from  sources  other  than  the  surface  being  viewed 
(e.g.,  light  reflected  from  the  atmosphere);  and 
(3)  the  relationship  between  the  light  reaching  the 
film  surface  and  the  intensity  value  ultimately 
recorded  In  the  digital  Image  array. 

Our  approach  to  Intensity  modeling  assumes 
that  we  have  no  scene-specific  Information 
available  to  us  other  than  the  image  data.  We  use 
a  model  of  the  Imaging  process  that  Incorporates 
our  knowledge  of  the  behavior  of  the  recording 
medium,  the  properties  of  atmospheric  transmission, 
and  the  reflective  properties  of  the  scene 
materials.  For  aerial  imagery  we  use  an 
atmospheric  model  that  assumes  a  constant  amount  of 
light,  (Independent  of  scene  radiance),  Is 
scattered  by  the  atmosphere  into  the  camera. 

I-R+S 

where  I  is  the  image  lrradlance,  R  the  scene 
radiance,  and  S  the  Image  lrradlance  caused  by 
atmospheric  scattering.  We  use  a  logarithmic 
relationship  between  Image  lrradlance  and  film 
density  D, 

D-  a*log(I)  +  d 

where  a  and  d  are  film  constants,  whose  values  need 
to  be  calculated.  For  a  surface  radiance  model  we 
assume  Lambertian  behaviour  (the  reflected  light  Is 
proportional  to  the  Incident  light,  the  constant  of 
proportionality  is  a  function  of  the  surface 
material,  and  the  relative  brightness  of  the 
surface  Is  Independent  of  the  location  of  the 
viewer). 

R-EAN 

where  E  is  the  Illumination  strength  (scene 
lrradlance),  A  the  surface  reflection  or  albedo, 
and  N  a  function  related  to  the  effects  of  surface 
orientation  (for  Lambertian  surfaces  this  Is  a 
function  of  the  angle  between  the  surface  normal 
and  the  light  direction). 

If  for  the  present  we  Ignore  surface 
orientation  effects,  that  Is  we  assume  all  surfaces 
are  orientated  In  the  same  direction,  then  our 
model  has  the  form 

D*a*log(A+b)+c 

where  a,b,and  c  are  constants  that  need  to  be 
determined.  b  is  the  ratio  of  atmospheric 
scattering  to  illumination  lrradlance. 

We  calibrate  our  model  by  Identifying  a  few 
regions  of  knovn  material  in  an  Image.  Three 
materials  are  sufficient.  The  fitting  is  achieved 
by  guessing  b  -  we  know  b  lies  In  the  range  0  to  1 
-  applying  the  least  squares  method  to  the 
resultant  linear  equation  to  calculate  a,c,  and  the 
residual  stm,  and  adjusting  b  to  minimize  this 
residual  sum. 
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The  resultant  model  Is  used  to  transform  the 
given  Image  Into  a  new  image  depicting  the  scene 
albedo.  The  albedo  Image  has  been  used  to  provide 
an  Initial  classification  (and  partitioning)  of  the 
scene  using  straight  forward  classification 
techniques  based  on  'known'  surface  albedos.  This 
technique  allows  classification  without  the  need  to 
provide  training  samples  of  all  classes  that  are 
present  in  the  image. 


V  SHADOW  DETECTION  (and  Raised  Object  Cueing) 

The  ability  to  detect  and  properly  identify 
shadows  Is  a  major  asset  In  scene  analysis.  For 
certain  types  of  features,  such  as  thin  raised 
objects  In  a  vertical  aerial  image,  it  la  often  the 
case  that  only  the  shadow  is  visible.  Knowledge  of 
the  sun's  location  and  shadow  dimensions  frequently 
allows  us  to  recover  geometric  information  about 
the  3-D  structure  of  the  objects  casting  the 
shadows,  even  In  the  absence  of  stereo  data  [8,10]; 
but  perhaps  just  as  Important,  distinguishing 
shadows  from  other  intensity  variations  eliminates 
a  major  source  of  confusion  In  the  Interpretation 
process. 

Given  an  intensity  discontinuity  In  an  image, 
we  can  employ  the  edge  labeling  technique  described 
earlier  to  determine  If  It  is  a  shadow  edge. 
However,  some  thin  shadow  edges  are  difficult  to 
find,  and  If  there  are  lots  of  edges,  we  might  not 
want  to  have  to  test  all  of  them  to  locate  the 
shadows.  We  have  developed  a  number  of  techniques 
for  locating  shadow  edges  directly,  and  will  now 
describe  a  simple  but  effective  method  for  finding 
the  shadows  cast  by  thin  raised  objects  (and  thus 
locating  the  objects  as  well). 

We  assume  we  either  know  the  approximate  sun 
direction,  or  equivalently,  the  shadow  vanishing 
point.  We  first  employ  a  thin  line  detector 
oriented  parallel  to  the  sun  direction  at  every 
location  In  the  Image,  and  then  apply  a  moving- 
window  averaging  technique  In  the  sun's  direction 
to  further  enhance  the  line  detector's  response  and 
reduce  noise.  The  result  of  these  operations  Is  to 
smear  both  the  noise  and  the  thin  shadow  lines.  We 
next  thin  the  shadow  lines,  eliminate  all  weak 
responses,  and  overlay  the  result  on  the  original 
Image.  T\e  foot  of  each  shadow  line  now  points  to 
the  base  of  the  thin  raised  object  casting  the 
shadow.  Given  the  results  from  two  (or  more) 
Images  taken  at  different  times,  the  Intersections 
of  shadow  lines  locates  the  objects  more  precisely 
and  also  eliminates  false  alarms. 

The  same  technique  has  been  applied  to  the 
detection  of  raised  objects  of  extended  size. 
Shadow  edges  of  the  extended  object  are  detected 
and  used  to  locate  the  object.  Figures  3-10  show 
this  approach  to  detecting  both  thin  and  extended 
raised  objects. 


VI  VISUAL  SKYLINE  DELINEATION 

Although  not  always  a  well  defined  problem, 
delineation  of  the  land-sky  boundary  provides 
important  constraining  information  for  further 
analysis  of  the  Image.  Its  very  existence  In  an 
image  tells  us  something  about  the  location  of  the 
camera  relative  to  the  scene  (l.e.,  that  the  scene 
is  being  viewed  at  a  high-oblique  angle),  allows  us 
to  estimate  visibility  (l.e.,  how  far  we  can  see  — 
both  as  a  function  of  atmospheric  viewing 
conditions,  and  as  a  function  of  the  scene 
content),  provides  a  source  of  good  landmarks  for 
(autonomous)  navigation,  and  defines  the  boundary 
beyond  which  the  image  no  longer  depicts  portions 
of  the  scene  having  fixed  geometric  structure. 

In  our  analysis,  we  generally  assume  that  we 
have  a  single  rlght-side-up  image  In  which  a 
(remote)  skyline  is  present.  Confusing  factors 
Include  clouds,  hare,  snow-covered  land  structures, 
close-in  raised  objects,  and  bright  buildings  or 
rocks  that  have  Intensity  values  Identical  to  chose 
of  the  sky  (a  casual  Inspection  of  an  image  will 
often  provide  a  misleading  opinion  about  the 
difficulty  of  skyline  delineation  for  the  given 
case).  Our  Initial  approach  to  this  problem  was  to 
Investigate  the  use  of  slightly  modified  methods 
for  linear  delineation  [4]  and  histogram 
partitioning  based  on  Intensity  and  texture 
measures;  we  employ  fairly  simple  models  of  the 
relationship  between  land,  sky,  and  cloud 
brightness  and  texture. 

Currently,  we  are  employing  a  region  based 
technique  which  operates  as  follows: 

To  eliminate  spurious  regions  and  gaps  In 
region  boundaries  caused  by  noise  we  first  reduce 
the  given  image  by  a  factor  of  at  least  4.  We 
partition  the  Image  into  a  nested  pyramid  of 
regions;  each  region  being  one  In  which  every  pixel 
has  an  Intensity  value  which  differs  by  less  than 
some  given  threshold,  from  a  least  one  of  Its  4- 
nelghbors.  The  nested  pyramid  Is  constructed  by 
using  a  sequence  of  Increasing  threshold  values 
(e.g.  2,4,8,...);  thus  if  T1  and  T2  are  thresholds 

such  that  T1  <  T2,  then  any  region  found  with 
threshold  T1  is  necessarily  Identical  to  a 
subregion  or  a  region  found  with  threshold  T2. 

A  "sky  seed"  Is  found  by  Identifying  the 
region  that  dominates  the  very  top  of  the  picture 
with  a  segmentation  threshold  of  2  (this  is  the 
lowest  threshold  that  allows  a  gradient  to  exist 
within  a  region),  -'or  a  clear  sky,  or  a  sky  with 
cumulus  clouds  comp.etely  surrounded  by  clear  sky, 
this  step  usually  identifies  the  entire  sky. 
Figure  11  shows  an  urban  scene  with  overcast  sky 
and  figure  12  shows  the  same  scene  with  the  sky 
seed  overlayed. 

As  an  additional  piece  of  information,  the  sky 
seed  Is  classified  as  clear  sky,  overcast  sky,  or 
patchy  clouds.  Patchy  cumulus  clouds  appear  as 
large  bright  regions  within  a  clear  sky  region, 
while  the  brightness  function  for  a  clear  sky  can 
be  modeled  as  a  linear  function  of  the  Image 
coordinates.  Although  the  equations  governing 
clear  sky  luminance  are  complex  integro- 
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differential  equations,  it  was  determined 
eaflrlcally  that  for  the  viewing  angles  produced  by 
a  50am  lens,  a  (linear)  planar  model  provided  a 
good  fit.  To  determine  whether  the  sky  seed  Is 
clear  sky  or  overcast  sky,  a  least  squares  fit  to 
the  planer  aodel  Is  aade,  and  the  mean  square 
error,  corrected  by  the  aeasured  Intensity 
variance.  Is  coa pared  to  a  fixed  thresholded.  The 
classification  of  the  sky  Into 
clear/overcast/patchy  clouds  can  help  to  resolve 
soae  of  the  confusing  factors  In  skyline  detection, 
but  this  Information  Is  not  currently  used. 

Next,  a  line  spanning  the  picture  from  right 
to  left  is  found  that  is  either  at  or  below  the 
true  skyline;  this  line  Is  found  by  doubling  the 
threshold  for  segmentation  until  the  region 
containing  the  sky  seed  touches  the  bottom  of  the 
picture.  Since  we  make  an  Initial  assumption  that 
the  sky  does  not  touch  the  bottom  of  the  picture, 
this  threshold  Is  then  backed  off  by  a  factor  of  2 
and  a  "land  seed"  Is  defined  as  the  complement  of 
the  sky  region.  The  assumption  here  Is  that  the 
skyline,  or  soae  boundary  In  the  land  region.  Is  of 
higher  contrast  than  any  extended  boundary  within 
the  sky.  For  all  IS  test  pictures  that  we  employed 
in  our  experiments,  this  assumption  was  only 
violated  once  (where  a  particularity  bright  cumulus 
cloud  on  the  horizon  formed  a  brighter  boundary 
with  the  sky  than  a  bright  rock  on  the  horizon; 
such  a  situation  can  be  easily  detected  after 
Initial  processing).  Figure  13  shows  the  case  in 
which  a  region  containing  the  sky  seed  touches  the 
bottom  of  the  picture  at  a  threshold  of  16,  and 
Figure  14  shows  the  picture  split  Into  a  sky  seed, 
land  seed,  and  ambiguous  unclassified  portion.  The 
land  seed  Is  determined  by  using  a  threshold  of  8. 
Figure  15  shows  an  additional  and  more  typical 
example  of  skyline  delineation 

In  a  substantial  number  of  pictures  the  sky 
and  land  seeds  touch,  thereby  delineating  the 
skyline.  If  the  sky  and  land  seeds  do  not  have  a 
common  boundary,  a  portion  of  the  picture  is  left 
unclassified,  bounded  by  the  sky  seed  above  and  the 
land  region  below.  Current  work  focuses  on 
developing  methods  to  disambiguate  the  unclassified 
portion  of  the  picture.  The  methods  under 
development  are  generic  to  all  types  of  scenes  and 
our  approach  does  not  use  semantic  knowledge  of 
particular  land  features.  Prior  work  on  this 
topic,  employing  considerable  semantic  knowledge, 
is  contained  in  Sloan  [11]. 


VII  SURFACE  MODELING 

Obtaining  a  detailed  representation  of  the 
visible  surfaces  of  the  scene,  as  (say)  a  set  of 
point  arrays  depicting  surface  orientation,  depth, 
reflectance,  material  composition,  etc..  Is 
possible  from  even  s  single  black  and  white  Image 
[12,2].  A  large  body  of  work  now  exists  on  this 
topic, (see  [15,16]  for  recent  work  by  our  group), 
and  although  directly  relevant  to  our  efforts,  it 
Is  not  practical  to  attempt  a  discussion  of  this 
material  here.  There  Is,  however,  one  key 
difference  between  surface  modeling  and  the  other 
topics  we  have  discussed  —  the  extent  to  which  the 
partlculer  physicel  knowledge  modeled  constrains 


the  analysis  of  other  parts  of  the  scene.  In  this 
paper  we  have  been  primarily  concerned  with 
physical  models  that  provide  global  or  extended 
constraints  on  the  analysis;  surface  modeling  via 
point  arrays  provides  a  very  localized  constraining 
Influence. 


VIII  CONSTRAINT-BASED  STEREO  COMPILATION 

The  computational  stereo  paradigm  encompasses 
many  of  the  Important  task  domains  currently  being 
addressed  by  the  machine-vision  research  community 
[1];  It  Is  also  the  key  to  an  application  area  of 
significant  commercial  and  military  Importance  — 
automated  stereo  compilation.  Conventional 

approaches  to  stereo  compilation,  based  on  finding 
dense  matches  In  a  stereo  image  pair  by  area 
correlation,  fail  to  provide  acceptable  performance 
In  the  presence  of  the  following  conditions 
typically  encountered  in  mapping  cultural  or  urban 
sites:  widely  separated  views  (In  space  or  time), 
wide  angle  views,  oblique  views,  occlusions, 

featureless  areas,  repeated  or  periodic  structures. 
As  an  Integrative  focus  for  our  research,  and 
because  of  Its  potential  to  deal  with  the  factors 
that  cause  failure  In  the  conventional  approach,  we 
are  constructing  a  constraint-based  stereo  system 
that  encompasses  many  of  the  physical  modeling 
techniques  discussed  above. 

Figure  16  show  how  a  stereo  system  can  exploit 
global  geometric  constraints.  First,  strslght 
lines  and  vanishing  points  are  found  in  the  two 
stereo  Images  as  described  earlier  (see  Section 
II).  Lines  are  first  classified  according  to  which 
vanishing  point  they  pass  through.  Those  lines  not 
associated  with  the  detected  vanishing  points  are 
Ignored.  The  vanishing  points  in  the  two  stereo 
views  are  then  matched.  The  direction  in  space 
established  by  a  vanishing  point  Is  a  feature  of 
the  scene  which  is  Invariant  under  translation  of 
the  camera.  Two  matches  of  vanishing  points  are 
sufficient  to  calculate  the  rotational  differences 
between  the  cameras  l.e.,  the  rotation  required  to 
bring  one  camera's  vanishing  points  into  congruence 
with  the  other's.  Two  matches  of  ordinary  points 
are  now  sufficient  to  determine  the  translation  of 
one  camera  with  respect  to  the  other  (up  to  an 
unknown  scaling  factor). 

Using  vanishing  points  can  improve  stereo 
matching  even  when  the  exact  camera  model  Is 
unknown.  In  Figures  161  and  16j,  lines  passing 
through  a  vanishing  point  In  one  Image  are  first 
matched  to  the  set  of  lines  passing  through  the 
corresponding  vanishing  point  in  the  other  image. 
For  example,  right-image  lines  pasnlng  through  the 
vertical  vanishing  point  are  only  matched  to  left- 
image  lines  that  also  pass  through  the  vertical 
vanishing  point.  Ulthln  these  subsets,  lines  are 
matched  according  to  a  score  based  on  four 

features:  (1)  difference  in  distance  from  the 
vanishing  point  to  the  lines,  (2)  ratio  of  lengths, 
(3)  difference  In  contrast  and,  (4)  difference  In 
phase  l.e.,  the  angle  the  line  makes  with  the 
image-horizontal.  Each  subscore  Is  a  value  In  the 
tnterval  (0,1).  The  value  represents  the 
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likelihood  of  this  combination  of  the  four 
features.  The  subscores  are  combined 
multlpllcatively,  and  the  combination  with  the 
maximum  score  (above  a  preset  threshold)  Is  chosen. 
Even  this  simple  matching  technique,  using  no 
search  or  relaxation,  finds  an  adequate  number  of 
correct  matches. 


IX  CONCLUDING  COMMENTS 

When  a  person  views  a  scene,  he  has  an 
appreciation  of  where  he  la  relative  to  the  scene, 
which  way  Is  up,  the  general  geometric 
configuration  of  the  surfaces  (especially  the 
support  and  barrier  surfaces),  and  the  overall 
semantic  context  of  the  scene.  The  research  effort 
we  have  described  Is  Intended  to  provide  similar 
Information  to  constrain  the  more  detailed 
Interpretation  requirements  of  machine  vision 
(e.g.,  such  tasks  as  stereo  compilation  and  Image 
matching). 
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FIGURE  2  EXAMPLE  OF  EXTREMAL  EDGE 


FIGURE  1  EXAMPLE  OF  CAST-SHADOW  EDGE 
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FIGURE  3  THIN  SHADOW  LINE  DETECTOR 
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FIGURE  4  ORIGINAL  IMAGE 
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FIGURE  6  NOISE  REMOVAL  USING  MOVING 
WINDOW  INTEGRATION  ALONG 
SHADOW  DIRECTION 


FIGURE  5  RESULTS  OF  APPLYING  THE  LINE 
DETECTOR  TO  ORIGINAL  IMAGE 
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FIGURE  7  LINE  THINNING 
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FIGURE  9  RESULTS  USING  TWO  ADDITIONAL 
IMAGES:  COMPARE  WITH  FIGURE  10 
(SHADOW  DETECT ION > 


FIGURE  8  THRESHOLDED  LINES  OVERLAYED 
ON  ORIGINAL  IMAGE  :  compare 

WITH  FIGURE  4 
(SHADOW  DETECTION) 


FIUGRE  10  RESULTS  FOR  DETECTING 
EXTENDED  OBJECTS 
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FIGURE  12  SKY  SEED  FOUND  WITH  REGION 
SEGMENTATION  AT  THRESHOLD  2 


FIGURE  II  URBAN  SCENE  WITH  OVERCAST  SKY 
(SKYLINE  DELINEATION) 


(SKYLINE  DELINEATION) 


FIGURE  14  PICTURE  SEGMENTED  INTO  SKY  SEED, 

UNCLASSIFIED  T’ORTION,  AND  LAND  SEED 

(SKYLINE  DELINEATION) 


FIGURE  13  THRESHOLD  IS  DOUBLED  UNTIL  REGION 

CONTAINING  SKY  SEED  TOUCHES  "BOrTOM’ 
15%  OF  PICTURE 

(SKYLINE  DELINEATION) 
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FIGURE  16  STEREO  MATCHING  USING  GLOBAL 
PERSPECTIVE  CONSTRAINTS 
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(e) 

Gaussian  Mapping  and  Vanishing  Points 


(g> 

Parallel  lines 


FIGURE  16  (CONTINUED) 

STEREO  MATCHING  USING  GLOBAL 
PERSPECTIVE  CONSTRAINTS 


297 


SYMBOLIC  MATCHING  OF  IMAdS  AM)  SCENE  MODELS 


Keith  E.  Price 
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Los  Angeles,  California  90089-0272 


Abstract 


In  this  paper  we  explore  the  application  of  a 

general  relaxation  based  matching  procedure  to  the 
problem  of  matching  pairs  of  images  and  the 
extension  of  basic  techniques  to  matching 
hierarchical  descriptions  of  scenes.  These  problems 
require  extension  to  the  general  graph  matching 
procedure  which  account  for  the  special  properties 
of  the  given  task.  The  efficiency  of  the  matching 
program  is  greatly  improved  by  including  these 
changes  - 

I.  Introduction' 

Matching  of  images  and  descriptions  has  many 
different  uses  and  can  be  performed  at  several 
different  levels.  Some  matching  tasks  require  that 
very  precise  corresponding  locations  are  computed 
(e.g.,  stereo  depth  computation,  pixels  level  change 
detection) .  But  for  many  tasks,  matching  at  a 
grosser  level  (i.e.,  finding  correspondences  between 
large  areas)  is  best.  This  paper  discusses  results 
of  a  general  symbolic  level  image  matching  system 
applied  to  the  task  of  matching  an  image  and  an  a 
priori  description  of  the  scene  (a  model)  to  locate 
objects  in  the  image  and  the  task  of  matching  two 
images  to  find  the  location  of  an  object  in  two 
different  vien®.  Thus  we  use  this  program  to  find 
correspondences  between  areas  of  the  images  (or 
objects)  rather  than  to  find  a  pixel  level  mapping 
between  them. 

II.  Background 

The  work  reported  here  represents  an  extension 
of  earlier  relaxation  based  symbolic  matching 
efforts  [1].  A  variety  of  image  matching  techniques 
have  been  develoned  for  different  tasks.  Etoravec 
[21  has  developed  a  system  which  locates  feature 
points  in  one  image  (essentially  corners;  and  uses  a 
correlation  based  matching  procedure  at  multiple 
resolutions  to  efficiently  find  a  set  of 
corresponding  points  in  the  two  images.  This  system 
is  intended  for  land  based  robot  navigation  which 
uses  the  three  dimensional  information  from  these 
feature  points  for  navigation.  A  stereo  system 
developed  by  Baker  [3)  generates  a  complete 
disparity  map  starting  from  edge  correspondences 
which  can  be  used  for  depth  computations  if  the 
camera  positions  are  know).  These  two  (and  many 
other  similar  efforts)  concentrate  on  precise 
matching  of  image  data. 


Several  systems  which  work  on  a  variety  of 
symbolic  representations  have  also  been  developed. 
Barnard  and  Thompson  [4]  have  developed  a  relaxation 
based  motion  analysis  program  v*iich  finds 
corresponding  feature  points  in  two  images.  The 
feature  points  are  similar  to  those  of  Moravec  (2) , 
but  they  are  located  in  both  images.  Wong  et  al. 
(5]  also  use  a  relaxation  procedure  to  match  corners 
which  are  detected  in  pairs  of  images.  This  system 
allows  arbitrary  translations  and  rotations  of  the 
camera.  Clark  et  al.  [6)  have  developed  a  system 
to  match  line  like  structures  (generally  edges  or 
region  boundaries) .  The  program  uses  three  initial 
matching  lines  to  get  a  mapping  between  the  two 
images.  The  quality  of  the  match  depends  on  how 
well  all  the  other  lines  match,  and  the  best  match 
is  determined  by  trying  all  possible  triples  of 
matching  lines.  The  nunber  of  possible  triples  is 
limited  by  the  allowable  transformations,  i.e., 
given  one  match,  the  possible  matches  for  the  other 
two  are  very  restricted.  Gennery  [7]  extracts 
objects  and  uses  a  tree  searching  procedure  to  find 
the  best  match. 

The  relaxation  procedure  used  here  is  developed 
more  fully  in  [1,  81  and  differs  from  other  methods 
in  its  gradient  optimization  approach.  There  are 
several  alternative  relaxation  updating  schemes  such 
as  the  basic  method  of  Itosenfeld  et  al.  (9),  Pel eg 
(101,  and  Huimel  and  Zucker  [11].  We  have 
implemented  these  other  methods  and  use  them  for  a 
comparison  of  the  results. 

Ill .  Description  and  Matching 

This  matching  system  uses  feature-based 
symbolic  descriptions  for  its  input.  The 
description  of  an  idealized  version  of  the  scene  (a 
model)  is  developed  by  the  user  through  an 
interactive  procedure.  The  image  descriptions  are 
derived  automatically  frcm  input  images.  The 
wderlying  descriptive  mechanism  is  a  semantic 
network.  The  nodes  of  the  network  are  the  basic 
objects  with  associated  feature  values  and  the  links 
indicate  the  relations  between  objects. 

The  basic  objects  used  in  the  image  description 
are  regions  or  linear  features  extracted  by 
automatic  segmentation  procedures  [12,  13).  These 
procedures  produce  a  set  of  objects  can  posed  of 
connected  regions  which  are  homogeneous  with  respect 
to  sane  feature  in  the  input  image  and  long  narrow 
objects  which  differ  from  the  background  on  both 
sides  and  can  be  represented  as  sequence  of  straight 
line  segments.  Chly  the  important  objects  are 
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described  in  the  model.  The  automatic  image 
segmentation  produces  many  objects  which  ar$  not 
included  in  the  model  (as  many  as  100-300  elements) . 
The  model  description  determines  the  outcome  of  the 
matching  procedure  and  can  also  be  used  to  guide  the 
segmentation  procedure  [14]. 

The  description  is  completed  by  extracting 
features  of  the  regions  and  linear  objects.  The 
features  are  those  which  can  be  easily  computed  from 
the  data  and  which  are  reasonably  consistent.  These 
features  include  average  values  of  the  image 
parameters  (intensity,  colors,  etc.),  size, 
location,  texture,  and  simple  shape  measures  (length 
to  width  ration,  fraction  of  minimum  bounding 
rectangle  filled  by  the  object,  perimeter  2/area, 
etc.) .  Relations  included  in  the  description  are 
those  which  are  easily  computed;  such  as  adjacency, 
relative  position,  (north  of,  east  of,  etc.),  near 
by,  and  an  explicit  indication  of  not  near  by. 

The  basic  goal  for  the  matching  procedure  is  to 
determine  which  elements  in  the  image  correspond  to 
the  given  objects  in  the  model.  Most  of  the  objects 
cannot  be  recognized  based  on  features  alone.  They 
require  contextual  information  to  be  accurately 
located.  An  important  idea  used  by  the  matching 
system  is  to  locate  a  small  set  of  corresponding 
objects  using  feature  values  and  weak  contextual 
information.  These  initial  islands  of  confidence 
provide  the  context  needed  for  finding 
correspondences  for  the  less  well  defined  objects. 
Finally,  when  most  objects  are  assigned,  the 
matching  can  be  done  solely  on  the  basis  of  context, 
i.e.,  radical  changes  in  a  few  objects  do  not  cause 
the  matching  program  to  fail. 

The  basl-.  ' operation  of  the  matching  system  is 
outlined  in  Fiq.  1.  In  the  large  outer  loop  a  set 
of  possible  matching  regions  is  located  for  every 
element  in  the  model.  Each  of  these  possible 
assignments  has  a  rating  (probability)  based  on  how 
well  the  model  and  image  elements  correspond.  These 
ratings  are  refined  by  the  relaxation  procedure  in 
the  inside  loop,  until  one  or  more  model  elements 
have  one  highly  likely  assignment  (usually  a 
probability  greater  than  0.7  or  0.8).  At  this  point 
a  firm  assignment  is  made  and  the  likely  assignments 
are  recomputed  using  these  assigned  elements  to  give 
the  context  for  the  match.  The  inner  relaxation 
procedure  updates  the  probabilities  of  the 
assignment  based  on  how  compatible  the  assignment  is 
w'th  the  assignments  of  its  neighbors  in  the  graph 
(i.e.,  objects  linked  by  relations).  We  use  a 
variety  of  relaxation  schemes  (1,  8,  9,  10,  11,  15) 
in  this  loop,  with  the  criteria  optimizing  method  in 
[1,  8]  giving  the  best  results. 

Matching  Details 

The  quality  of  match  between  two  elements  (one 
each  from  the  model  and  image  or  from  two  different 
images)  is  given  by  the  weighted  sum  of  the 
magnitude  of  the  feature  value  differences, 

R(u,n)  =  £  Kt-VntKSt  «> 


where  u  is  an  element  from  the  model  n  from  the 
image,  m  is  the  mmber  of  features  being  considered, 
and  Vut(Vni)  is  the  value  of  the  tth  feature  of 
elanent  u(n) .  Wt  is  a  normalization  weight  (the 
same  for  all  tasks)  to  equalize  the  impact  from  all 
features.  St  is  the  task  dependent  strength  of  a 
given  feature.  These  strength  values  distinguish 
between  important,  average,  and  unimportant 
features.  The  ratio  of  the  strength  values  is  5:1 
and  there  is  a  fourth  strength  zero  which  indicates 
a  feature  is  not  used.  This  rating  function  is 
converted  to  the  range  [0,  1)  by 


f (u,n) 


a 

R(u,n)+a 


(2) 


where  a  is  a  constant  which  controls  how  steep  the 
differences  function  is.  A  value  of  1  (a  sharply 
declining  function)  produces  the  best  results  with 
the  optimization  updating  approach.  Relations  are 
easily  included  in  this  scheme.  Vut  is  the  number 
of  relations  of  type  t  which  are  specified  in  the 
model  and  Vnt  is  the  number  which  actually  occur  in 
the  image.  Figure  4  illustrates  how  these  values 
are  computed  for  a  given  U£.  For  each  possible 
corresponding  region  nk,  check  all  Uj  (in  the  model) 
which  are  related  to  Uj  to  seeJ  if  the  given 
correspondence  (n£)  for  Uj  is  properly  related  to 
nk.  Wien  computing  the  initial  probabilities  of  a 
match,  only  those  uj  which  have  been  previously 
assigned  can  be  considered.  The  basic  compatibility 
measure  is  computed  when  given  two  potential 
assignments,  therefore  rather  than  using  all  the 
Other  units  in  the  model  use  only  the  specified  unit 
uj. 


The  relaxation  procedures  -.equire  a  function 
which  measures  the  ccmpatibil  ity  of  a  particular 
assignment  nk  with  the  current  assignments  at  all 
neighboring  (related)  units.  This  is  defined  by 


Qilnk)=u-S  N .VV^'V,f(VVPi<V  (3) 

and 


"£ 


c(u.,n.  ,u.,n  )p.  (n  ) 

in  Wj 


(4) 


Wuece  Ni  is  the  set  of  objects  related  to  u^,  |NjJ 
is  the  number  of  neighbors,  a  is  a  factor  between  0 
and  1  that  adjusts  the  relative  importance  of 
features  versus  relations  (0.1  to  0.25  is  the  usual 
range),  p^n^)  is  the  current  probability  for 
assigning  Ui  to  nk,  W^  is  the  set  of  likely 
assignments  of  Uj  (forJ  efficiency  and  improved 
results  we  generally  use  only  the  one  most  likely 
assignment  here),  cfu^,  nk,  Uj,  is  the  same  as 
f(u^,  n^)  except  that  only  relations  between  u:  and 
Uj  are  considered.  The  vector  is  normalized  to 
give  a  probability  vector  which  is  used  by  the 
updating  step.  The  iterative  updating  is  given  by 


L<n+1)=Pi(n)+»nPi{9i(n)>  "=0,1,2,. 


(5) 
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where  p  is  a  positive  step  size  to  control  the 
convergence  speed,  is  a  linear  projection 
operator  to  maintain  the  constraint  on  fn+1)  that 
is  is  a  probability  vector,  and  is  an  explicit 
gradient  function  determined  hv  the  criteria  to  be 
optimized. 

gi(V  =  ^i(V“pi(nk)lNil°lf(Ui'nk>tl“qi<V)/Di 


£  i  £  (6) 

u.  such  that  3  n0  In  W. 

3  K  J 


c(uj'Vui'V  (pj  (nt)-pj‘qj) 

where 


Thus  we  simply  extend  the  basic  network 
description  of  the  model  to  include  for  each  element 
a  pointer  to  the  group,  and  feature  values  relative 
to  the  group  average  values,  with  relations  between 
the  groups  specified  as  links  between  the  group 
nodes.  Group  features  are  not  available  for  the 
image  description  until  the  correspondences  are 
located.  Initial  growings  could  be  computed  in 
limited  cases  by  creating  sets  of  objects  vrtiere  each 
is  near  at  least  one  other  member  of  the  set.  We 
could  consider  groups  as  descriptions  at  higher 
levels  in  a  generalized  (tytamid  structure  [16],  but 
our  description  of  the  higher  level  object  is  based 
solely  on  the  lower  level  descriptions  of  its  parts 
rather  than  the  description  of  the  object  at  lower 
resolutions.  For  a  description  which  encompasses 
more  than  two  levels,  a  general  multilevel 
description  should  be  used,  but  a  matching  scheme 
would  require  a  means  for  linkage  between  levels. 


D. 

1 


m 

Lyy 

k=l  1 


(7) 


where  m  is  the  number  of  possible  assignments. 


IV.  Extensions  to  the  Matching 

This  basic  matching  procedure  is  able  to 
adequately  perform  the  match  for  many  tasks,  but 
there  are  extensions  v*iich  are  required  for  others. 
These  include  extensions  to  handle  multiple  levels 
of  descriptions  for  the  scene  and  those  to 
facilitate  the  image  to  image  matching  process. 


A.  Groups 

The  matching  procedure,  as  described  so  far, 
handles  relations  between  two  specific  elements,  if 
relations  among  three  or  more  objects  are  desired 
they  are  specified  by  combinations  of  binary 
relations.  They  may  not  yield  ixiique  matches,  but 
explicit  higher  order  relations  are  too  expensive  to 
compute  and  use.  We  extend  the  matching  and 
description  system  to  include  relations  between 
groups  of  elements.  These  groups  are  specified  in 
the  model  and  can  be  composed  of  an  object  or  a 
collection  of  separate  objects  that  can  be  more 
easily  related  to  others  as  a  group.  Fbr  example, 
in  Fig.  2,  the  area  of  San  Francisco  can  be 
considered  as  a  group  composed  of  the  urbanized 
area,  and  the  park-like  areas.  The  bay,  bridges, 
and  islands  can  form  another  group.  The  separate 
clusters  of  storage  tanks  or  buildings  could  be  used 
to  form  groups  in  Fig.  3. 

We  make  several  assumptions  about  the  groups  of 
objects.  (1)  The  components  of  a  group  are 
spatially  close,  not  widely  scattered  through  the 
image.  (2)  Relations  (adjacency,  above,  etc.) 
between  elements  within  a  group  are  meaningful,  but 
usually  not  between  individual  elements  in  two 
separate  groups.  (3)  Relations  between  groups  are 
consistent  end  predictable.  (4)  The  feature  values 
fbr  individual  objects  relative  to  the  averages  for 
the  group  are  well  defined  (e.g.,  intensity  greater 
than  average,  x  location  in  the  top  fifth) .  This 
easily  handles  structures  in  aerial  images  and  might 
be  extended  to  three-dimensional  structures  possibly 
with  some  changes  in  assumptions. 


These  group  features  and  relations  are 
incorporated  into  the  matching  procedure  much  the 
same  as  the  initial  features  and  relations 
(Eqs.  1-6).  But,  we  apply  a  second  relaxation  step 
in  the  inner  loop  (see  Fig.  1)  using  only  the  group 
features  and  relations  to  compute  the  compatibility 
measures.  The  average  feature  values  and  the 
location  of  each  group  are  computed  from  the  current 
most  likely  assignment  fbr  each  of  the  components  of 
the  group  (i.e.,  the  top  one  after  the  previous 
relation  updating  step)  .  Figures  4  and  5  illustrate 
how  group  relations  enter  in  compatibilities.  As 
illustrated  in  Fig.  4  the  measure  for  standard 
relation  is  given  by  whether  the  relation  specified 
in  the  model  between  two  elements  actually  occurs 
between  the  two  possible  assignments.  The  test  for 
group  relations  is  a  bit  different.  The 
compatibility  measure  (ctu^,  Uj,  ne) )  can  be 
computed  only  for  u ;  in  group  G ;  and  u j  in  group  Gj 
irtiere  G^  is  related  (is  a  neighbor  in  the  graph)  to 
Gj.  The  problem  is  then  to  determine  if  nfc  is 
properly  related  to  Gj  (e.g.,  above)  and  G^  is  also 
related  to  n^  (above) .  R(u,  n)  as  given  in  Bq.  1  is 
computed  in  the  sane  manner  except  all  possible 
second  model  units  (uj)  are  considered.  (Possible 
in  this  case  means  that  the  two  groups  are  related 
and  Uj  is  assigned.) 

The  relations  between  groups  are  specified  by 
the  model  and  the  test  between  n^  and  Gj,  must  be 
computed  each  time  since  specifications  of  the  group 
(location,  extent,  etc.)  may  change  on  every 
iteration.  The  relations  between  simple  elements  in 
the  model  should  correspond  to  relations  between 
elements  in  the  segmentation  of  the  image  so  that 
these  can  be  computed  once  and  stored.  This 
difference  results  in  an  increased  computation  time 
for  relations  between  groups  compared  to  relations 
between  basic  elements. 

The  group  information  can  be  used  in  othre 
important  ways.  The  matching  procedure  operates  on 
every  element  in  the  model  description  at  every 
step,  but  once  a  group  has  been  completed  there  is 
no  need  to  repeatedly  consider  the  elements  in  the 
group  for  additional  assigrments.  It  is  not  the 
case  that  once  an  individual  element  is  assigned  it 
can  be  removed  for  consideration  since  succeeding 
iterations  are  needed  for  correcting  any  errors 
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which  occur.  The  group  elements  are  not  eliminated 
from  the  entire  process  -  the  relations  to  other 
elements  are  even  more  important  than  before  -  they 
are  simply  given  fixed  probability  vectors  (ones  for 
all  current  firm  assignments) .  Thus  the  main 
computation  steps  (Eq.  1-7)  are  not  needed  for  these 
elements  which  reduces  the  computation  time 
proportionally.  By  adding  this  assigned  group 
elimination  to  the  matching  procedure  the  time  is 
reduced  by  about  half  and  the  results  remain  the 

S3H1G  . 

B.  image  to  Image  Matching 

Matching  of  images  at  a  symbolic  level  can 
provide  information  similar,  though  not  identical, 
to  pixel  level  image  matching.  The  result  is  a  set 
of  pairs  of  corresponding  objects.  From  these  it  is 
easy  to  extract  global  transformations  (scale, 
position,  orientation,  intensity  shifts,  etc.), 
relative  displacements  (for  relative  object 
heights) ,  and  local  object  changes.  The  matching 
system  is  identical  to  that  used  for  the  model  to 
image  matching,  but  there  are  some  differences  in 
how  it  is  used. 

Same  of  the  differences  are  caused  by  the 
differences  in  the  nature  of  the  descriptions  of 
images  and  scene  models.  The  scene  model  contains 
only  important  objects  and  only  those  feature  values 
and  relations  which  are  relevant  or  consistent.  The 
image  segmentation  cannot  be  restricted  in  the  same 
way,  thus  there  are  many  extra  unimportant  regions, 
all  feature  values  are  available  for  all  regions  and 
all  possible  relations  between  two  regions  are 
included  in  the  description. 

The  increased  nimber  of  regions  is  addressed 
first,  rather  than  trying  to  find  a  match  for  all 
regions  in  an  image  we  can  restrict  the  search  to 
those  regions  which  meet  a  given  criterion.  We  can 
filter  the  image  to  eliminate  ill  formed  regions 
(using  the  shape  parameters  with  very  loose 
thresholds) ,  or  can  restrict  the  match  to  seme  other 
subset  of  regions  (the  brightest  half)  . 

The  availability  of  all  features  is  more  a 
benefit  than  a  liability.  We  can  use  absolute 
locations  as  very  strong  features,  after  initial 
matches  are  located  which  can  provide  the  necessary 
transformations  (translation,  etc.).  By  using 
absolute  position,  the  matching  can  be  performed 
when  differences  occur  in  image  segmentations  and 
feature  descriptions.  Initially  the  absolute 
position  cannot  be  used  in  the  matching  since  we 
allow  arbitrary  translations,  but  when  several 
matches  are  located  we  can  generate  global 
transformations  which  will  approximately  register 
the  images.  Because  of  distortions,  height 
differences,  segmentation  differences,  etc.,  no 
global  transformation  will  work  perfectly,  but  the 
object  positions  can  be  used  as  important  features. 
This  is  implemented  by  adding  a  transformation 
generation  step  prior  to  the  determination  of 
initial  likelihoods.  The  transformation  is 
generated  using  the  objects  with  translations 
closest  to  the  mean  translation.  This  selection  can 
be  done  in  tuny  ways  with  different  degrees  of 
complexity,  we  chose  a  siaple  method  since  we  do  not 


require  subpixel  level  accuracy  in  the  location 
transformation.  The  strengths  of  the  position 
features  are  increased  from  low  to  medium  to  high  as 
mote  correspondences  are  located.  Additionally  the 
nuriber  of  iterations  to  try  before  termination  must 
be  reduced  vrtien  there  are  few  (less  than  10)  regions 
remaining  to  be  assigned. 

The  final  change  for  image  to  image  matching  is 
to  perform  the  match  in  both  directions 
independently.  This  means  that  when  we  match  two 
images  A  and  B  we  treat  A  as  the  model  and  find  the 
correspondences  in  B,  then  treat  B  as  the  model  and 
find  the  correspondence  in  A.  The  final  result  is 
all  the  pairs  of  regions  which  are  located  in  both 
cases.  This  eliminates  a  few  correct  matches  which 
are  found  in  one  case  but  al90  eliminates  most  (all 
in  the  examples)  incorrect  matches  since  these  are 
predominantly  caused  by  segmentation  differences 
(combined  or  missing  regions) . 

C.  Local  Matching 

The  matching  procedure  usually  locates  clear 
matches  first  and  then  builds  on  these  islands  of 
confidence.  The  building  steps  generally  require 
few  (i.e.,  one)  iterations.  This  fact  can  be  used 
to  substantially  improve  the  computation  time  of  the 
matc)iing  procedure.  If  we  consider  only  those 
elements  which  are  strongly  related  to  already 
assigned  objects  and  allow  very  few  (e.g.,  one) 
iterations,  it  is  possible  to  quickly  and  cheaply 
make  many  assignments.  We  limit  the  nunber  of 
iterations  on  this  step  to  one  since  we  are  using 
only  very  local  information  which  is  much  less 
reliable  than  the  full  global  description.  This 
means  that  we  are  very  confident  in  the  assignment 
which  arises  -  there  are  no  competing  possibilities. 
More  assignments  are  made  if  more  iterations  are 
allowed,  but  many  more  errors  are  introduced.  Wien 
thre  are  no  unassigned  objects  strongly  related  to 
ones  that  ace  assigned  or  no  quick  assignment  is 
found  then  the  procedure  reverts  to  the  normal 
matching.  This  can  also  be  applied  with  the  group 
elimination  operation  reducing  the  time  even  more. 
When  using  the  two  step  hierarchical  matching 
process  wie  only  use  the  regular  features  and  ship 
the  group  step  isnee  we  are  concentrating  on  strong 
local  Information  only.  The  determination  of 
strongly  related  elements  can  be  arbitrarily 
complex,  but  we  have  chosen  to  limit  it  to  a  simple 
test  of  adjacency  or  nearby,  since  many  of  the 
relative  position  relations  are  already  limited  to 
these  kinds  of  regions  (especially  in  the  model 
description) . 

V.  Results 

We  have  applied  this  system  to  a  variety  of 
images  (generally  two  views  of  each  scene,  see 
Fig.  2,  3).  for  different  views  of  the  same  scene, 
we  use  the  sane  model.  The  results  are  presented  as 
overlays  on  the  original  images,  showing  the  border 
of  regions  or  center  lines  of  linear  features.  The 
labels  are  taken  from  the  name  given  in  the  model, 
either  the  user  derived  model  or  the  image  Which 
serves  as  a  model.  TSble  1  8 notarizes  the  results. 
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Figure  6  shows  the  results  of  matching  the 
model  to  two  images  of  San  Francisco  area  (Fig.  2) . 
The  errors  in  the  second  view  (Fig.  6b)  are  caused 
by  the  segmentation  errors.  The  two  sections  of  the 
Bay  Bridge  are  missed  by  the  linear  features  program 
and  this  causes  these  two  to  be  missed  plus  the 
island  which  is  adjacent  to  the  bridges  and  both 
portions  of  the  bay  is  mismatched.  (Note  that  the 
two  sections  of  the  bay  were  intended  to  be  split  by 
the  bridges.)  Figure  7  is  the  same  result  except 
that  the  group  features  and  relations  are  used.  The 
results  are  the  same  except  that  one  section  of  the 
bay  bridge  in  view  2  is  not  matched  (which  is  the 
correct  result)  and  a  second  match  is  found  for  a 
park  area  in  view  1.  The  computation  times  (with 
and  without  group  features)  are  similar. 

Figure  8  gives  the  results  for  a  subwindow  of 
the  low  altitude  aerial  images  (Fig.  3)  without  the 
group  information.  Figure  9  shows  the  improvement 
when  group  features  and  relations  are  used.  In  the 
first  view  2  fewer  mistakes  are  made.  In  the  second 
view  mistakes  are  reduced  by  7  and  correct  matches 
are  increased  by  3.  Because  of  the  cost  of  group 
relations  the  computtion  time  increased 
substantially.  Different  objects  are  segmented 
poorly  in  the  two  views,  but  the  matching  still 
works  well  for  both.  In  the  seven  errors  (see 
Table  1),  3  are  objects  with  no  correct  match,  3  are 
multiple  matches  where  the  correct  match  also  occurs 
and  one  is  an  extra  match  to  a  small  nearby  region. 
Figure  10-12  illustrate  the  image  to  image  matching 
process,  in  Fig.  10  the  first  view  A  is  used  as  the 
model,  and  the  second  is  used  in  Fig.  11  (the  image 
used  as  the  model  is  the  one  on  the  left) . 
Figure  12  shows  those  pairs  which  occur  in  both 
cases.  Thble  2  gives  the  computed  disparities  for 
each  of  these  31  matched  objects. 

VI.  Suwnary  and  Conclusions 

This  paper  presents  an  extension  to  an  earlier 
symbolic  matching  program.  The  extensions  improve 
the  performance  of  the  matching  procedure  for  model 
to  image  matching  when  there  are  groups  or  clusters 
of  objects.  Additional  changes  improve  the 
performance  of  the  image  to  image  matching  task. 
The  matching  results  are  very  good,  but  not  perfect. 
There  is  no  post  processing  to  eliminate  matches 
which  are  not  consistent  with  the  others  which  could 
reduce  the  errors. 
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TABLE  2.  TRANSLATIONS  COMPUTED  FROM  MATCHING 
REGIONS.  THESE  ARE  GROUPED  BY  THE 
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TABLE  3.  COMPARISON  WITH  OTHER  RELAXATION  TECHNIQUES. 

COMPLETE  IS  THE  OPTIMIZATION  APPROACH  [1,  8], 
PROJECT  ONLY  USES  THE  PROJECTION  FUNCTION 
BUT  NOT  OPTIMIZATION.  PRODUCT  COMBINES  THE 
INDIVIDUAL  MATCHING  VALUES  USING  PRODUCE 
RATHER  THEN  SUM  (AS  IN  EQ.  4) .  ORIGINAL  IS 
FROM  ;9]  AND  IS  GIVEN  FOR  HISTORICAL  PURPOSES. 
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Figure  2.  High  altitude  view  of 
San  Francisco  area. 
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Figure  3.  Low  altitude  aerial  image:  (a)  view  1  (October) , (b)  view  2  (August) 
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Figure  4.  Use  of  relations  in  compatibility  computation. 
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Figure  5.  Use  of  relations  between  groups  in  compatibility  computation 
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Figure  10.  Low  altitude  image  to  image  (view  1  used  as  model) 
matching.  Left:  view  1,  Right:  view  2 


Figure  11:  Image  to  image  (view  2  used  as  model).  Left:  view  2 
Right:  view  1 


Figure  12.  Image  to  image  matching,  the  combined  results  of  Fig.  10 
and  Fig.  11.  Left:  view  1,  Right:  view  2 
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Abstract 

t 

This  paper  describes  an  efficient  technique  for  calculat¬ 
ing  depth  maps  from  stereo  pairs  of  images.  This  tech¬ 
nique  is  based  on  a  linear  approximation  to  the  image 
at  each  point,  and  was  described  in  another  form  in 
(Lucas  81).  We  show  how  the  algorithm  can  be  aug¬ 
mented  by  smoothing  the  images,  iterating  the  calcula¬ 
tion,  and  using  weighting  factors.  A  theoretical  result 
concerning  the  performance  of  the  algorithm  is  pres¬ 
ented,  and  experimental  results  on  random  dot  stereo¬ 
grams  and  natural  images  are  described. 


1.  Introduction 

In  (Lucas  81),  we  presented  an  efficient  technique  for 
stereo  matching  based  on  the  use  of  derivatives  to  di¬ 
rect  the  search  for  the  best  match.  As  presented,  the 
technique  was  able  to  find  matches  for  isolated  points, 
for  example  feature  points  located  by  an  interest  opera¬ 
tor  (see  (Moravec  80)).  For  some  applications  it  is 
desirable  to  know  the  distance  at  a  dense  set  of  points 
or  at  each  pixel  in  the  image.  For  example,  image  seg¬ 
mentations  based  on  depth  (distance)  information  as 
well  as  light  intensity  would  not  cause  spurious  regions 
due  to  surface  markings  to  be  generated.  Another 
application  is  the  generation  of  topographic  maps  from 
aerial  stereo  images.  The  importance  of  the  depth  map 
as  an  intrinsic  image  in  image  interpretation  was  recog¬ 
nized  in  (Barrow  78).  In  this  paper  we  show  how  the 
technique  presented  in  (Lucas  81)  can  be  extended  for 
the  direct  computation  of  a  depth  map  from  a  stereo 
pair  of  images. 


A  depth  map  can  be  defined  as  an  "image"  in  which 
the  pixel  at  each  point  encodes  the  distance  of  the 
image  at  that  point  from  the  camera.  What  we  shall 
actually  compute  is  the  closely  related  disparity  map. 
The  value  h(xj/)  at  position  (xj)  in  a  disparity  map  is 
a  statement  that  the  image  at  position  (x^y)  in  (say) 
the  left  image  of  a  stereo  pair  corresponds  to  the  image 
position  (x  +  /i(xjr),  y)  in  the  right  image.  (What 
would  actually  be  stored  in  the  depth  map  would  be  a 
linear  function  of  hfxj)  that  maps  the  relevant  range 
of  disparities  onto  the  available  range  of  pixel  values.) 
That  is,  h(xjr)  encodes  the  disparity  between  the  left 
and  right  images  at  position  (xjr).  The  computation  of 
the  disparity,  or  equivalently  the  image  correspon¬ 
dence,  is  the  central  problem  of  stereo  image  interpre¬ 
tation.  Given  the  disparity  and  the  relative  camera 
positions,  calculating  the  distance  is  a  simple  matter  of 
geometry.  It  should  be  noted  that  the  definition  of 
disparity  given  above  assumes  that  certain  of  the  cam¬ 
era  axes  are  parallel.  If  this  assumption  is  violated,  the 
particulars  of  what  follow  change,  but  the  algorithm 
itself  is  not  materially  changed.  For  example,  it  is  al¬ 
ways  possible  to  resample  the  images  producing  a  new 
pair  of  images  which  do  satisfy  the  assumption. 

2.  Calculating  the  depth  amp  efficiently 

Let  L(xjr)  be  the  pixel  value  at  position  (x^y)  of  the 
left  image  of  a  stereo  pair,  and  R(.xj>)  be  the  pixel 
value  at  position  (x,y)  of  the  right  image.  As  implied 
by  our  definition  above,  assume  that 
Hxj)  «  H(x  +  h(xj/),y).  Our  goal  will  be  to  com- 

A 

pute  an  estimate  of  the  depth  map,  h(xj'),  such  that 
L(xj>)  is  as  nearly  equal  as  possible  to 
R(x  +  h(xjr),y).  Now,  one  way  ^to  do  this  would  be 
just  to  try  all  possible  values  of  h(xy)  at  each  (x,y). 


finding  the  one  which  minimizes 
|  L(xy)-R(x  +  h(x,y),  y)  | .  This  approach  has  two 
problems.  First,  it  is  inefficient.  But  more  important¬ 
ly,  it  is  now  possible  for  h(x,y)  to  vary  wildly  as  x  and 
y  vary.  But  we  know  that  h(xy)  will  vary  smoothly  at 
most  points,  due  to  the  coherence  of  matter.  (See  for 
example  (Marr  79)).  Thus  we  need  to  constrain  h(xy) 
to  vary  relatively  smoothly.  We  will  do  this  implicitly 
by  assuming  in  our  solution  that  h(xy)  is  nearly  con¬ 
stant  in  a  small  neighborhood  about  each  (x,y).  In 
particular,  we  will  choose  each  h(x,y)  to  minimize  the 
local  error 

E(x,y)  -  2  [/l(u  +  h(xy),  v)-L(u,v)]2.  (1) 

uy  near  xy 

(All  subsequent  sums  will  be  over  the  same  range.) 
This  relies  on  assuming  that  h( xy)  will  be  nearly  uni¬ 
form  over  the  region  if  summation  around  (xy).  Or, 
looked  at  in  another  way,  our  computation  of  h(xy) 
will  incorporate  information  from  a  region  around 
(xy).  Neighboring  values  of  h(xy)  will  incorporate 
information  from  nearly  identical  neighborhoods  of  the 
image,  and  thus  can  be  expected  to  vary  but  a  small 
amount.  This  will  be  seen  in  more  detail  below. 

A 

We  now  turn  to  the  problem  of  calculating  h(xy)  effi¬ 
ciently.  Here  is  where  the  technique  described  in 
(Lucas  81)  comes  in.  To  minimize  E(xy)  in  equation 
(1),  we  first  approximate  R(u  +  h(xy),  v)  as 
R(u,v)  +  h(xy)Rx(uy),  where  Rx  is  the  derivative  of 
R  with  respect  to  x,  obtaining 

E(xy)  at  2  (/!(«.»)  +  A(j\y)/IJf(u,v)-£(u,»')]2.  (2) 

A 

But  this  is  just  a  sum  of  terms  quadratic  in  A(jr,y). 
Thus,  we  can  differentiate  E(xy)  with  respect  to 
A(xj>)  and  set  the  result  equal  to  zero,  obtaining 

0-£  U?(«,»)  +  k(xy)Rx(uy)-Uuy))Rx(uy).  (3) 
Solving, 

h(xy)  -  - — - - .  (4) 

2  **<“•»> 

A 

Now  we  see  in  particular  now  h(xy)  is  constrained  to 
be  smooth.  Each  of  the  sums  in  the  numerator  and  in 
the  denominator  of  equation  (4)  is  roughly  a  smoothed 


version  of  an  "image"  similar  to  the  original  images 
R(xy )  and  L(xy).  The  ratio  should  be  similarly 
smooth  except  where  the  denominator  is  near  zero.  An 
inspection  of  the  denominator  shows  that  this  occurs 
only  where  the  image  is  relatively  constant;  in  such 
cases  the  derivative  of  the  image  gives  no  information 
and  so  the  method  is  not  expected  to  work  anyway. 
Use  of  this  algorithm  in  a  vision  system  should  take 
this  point  into  consideration. 

A 

From  the  formula  above,  we  can  see  that  h(xy)  can  be 
calculated  efficiently  if  we  take  "(uy)  near  (xvy)”  to 
mean  "(u,v)  in  a  rectangular  neighborhood  around 
(xy)".  The  efficiency  stems  from  a  well-known  incre¬ 
mental  technique  for  smoothing  an  image,  i.e.  calculat¬ 
ing  2  f(uy)  where  (uy)  ranges  over  a  rectangular  re¬ 
gion  around  a  point  (xy),  using  only  two  additions  and 
two  subtractions  per  pixel.  (See,  for  example,  (Price 
76)).  Thus,  h(xy)  above  can  be  calculated  using  only 
two  additions,  fou  subtractions  (including  one  to  com¬ 
pute  Rx(xy)),  two  multiplications,  and  one  division  per 
pixel. 

3.  Improving  Ike  calculation 

This  technique  does  not  perform  entirely  satisfactorily 
as  given.  In  particular,  it  cannot  tolerate  a  very  large 
disparity.  However,  as  discussed  in  (Lucas  81),  three 
things  can  be  done  to  make  it  a  viable  algorithm. 

First,  the  accuracy  of  the  linear  approximation  made  in 
equation  (2)  can  be  improved  by  smoothing  L(xy)  and 
R(xy).  This  can  also  be  thought  of  as  removing  high 
spatial  frequency  components  from  the  images.  The 
value  of  this  is  shown  by  the  following  observation:  If 
we  are  attempting  to  find  the  disparity  between,  say, 
two  identical  sine  waves,  then  a  disparity  of  one-half 
the  wavelength  of  the  sine  waves  results  in  an  inherent¬ 
ly  ambiguous  match,  and  for  any  larger  disparity,  the 
closest  match  is  incorrect.  Thus,  low  spatial  frequen¬ 
cies  can  be  matched  over  a  wider  disparity  range,  so 
removing  high  spatial  frequencies  should  increase  the 
range  of  unambiguous  match.  Consider,  for  example, 
the  problem  of  matching  two  buildings  textured  with 
windows.  If  the  disparity  is  greater  than  the  size  of  the 
windows,  then  a  mismatch  in  which  two  different  win¬ 
dows  are  paired  is  likely.  However,  if  we  first  smooth 
the  images,  blurring  out  the  windows  and  leaving  only 
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the  buildings,  then  we  should  expect  to  be  able  to 
match  the  buildings,  given  disparities  up  to  about  half 
the  distance  between  buildings,  at  which  point  we  may 
begin  to  incorrectly  pair  the  buildings.  If  such  large 
disparities  need  to  be  tolerated,  we  can  blur  the  images 
even  mere  and  match  groups  of  buiidings,  and  so  on. 
This  reasoning  is  made  more  precise  in  the  next  sec¬ 
tion. 

The  second  improvement  is  based  on  two  observations: 
first,  A(x,y)  as  computed  above  is  only  an  approxima¬ 
tion  to  A  (*,_>>),  and  second,  the  blurring  of  the  images 
has  blurred  out  detail  which  could  provide  a  more  accu¬ 
rate  match.  For  example,  if  we  blur  together  two  de¬ 
tails  whose  disparity  (depth)  is  different,  the  best  we 
could  hope  to  calculate  is  a  disparity  somewhere  mid¬ 
way  between  the  two.  Taken  together,  these  observa¬ 
tions  suggest  that  we  should  iterate  the  disparity  c:  Icu- 
lation,  using  the  previously  calculated  h(xj’)  as  an 
initial  guess  at  each  stage,  and  that  we  should  use  less 
blurred  versions  of  L(x,y)  and  R(x^y)  at  each  iteration. 
The  calculation  now  becomes 

**  +  l  <*•■>')  * 

Y  lL(u,v)-/?(u  +  hk(x,y),v)]  R  (u  +  hk(x,y),v) 

— - ^ - - - - - ,  (5) 

£/?,(«  + **(*,.>),  v)2 

where  L(xj/),  R(x,y)  and  Rx(x,y)  are  actually  blurred 
versions  of  the  left  and  right  images. 


version  of  the  problem:  givet^  /.(jr)^ 

R(x)  =  L(x  +  h),  calculate  h.  The  estimate  h 
gous  to  the  one  in  equation  (4)  is 

a  ^[L(u)-R{u)\  Rx:u) 
h  =  - . 

Now,  by  the  Fourier  theorem,  L(x)  can  be  represented 
by 

L(x)  =  2)  rk  cos  2»rfcx  +  Sk,  (7) 

k 


and 

analo- 


(6) 


where  the  rk  constitute  the  (square  root  of)  the  power 
spectrum  of  L(x)  and  the  9k  constitute  the  phase  spec¬ 
trum  of  L(x).  It  can  be  shown  that  the  estimate  for  h 
given  in  equation  (6)  is  equivalent  to 


A 

h  = 


^  krk  sin  2irkh 
k 


2 -2‘2'*2 


(8) 


Thus,  the  estimate  h  depends  only  on  the  actual  h  and 
the  power  spectrum  of  L(x).  This  is  a  highly  interesting 
resuit,  because  the  power  spectra  of  typical  images 
tend  to  be  similar,  while  all  the  information  is  con¬ 
tained  in  the  phase  spectrum  (see  (Oppenheim  81)). 
Thus  it  should  be  possible  to  make  general  statements 
about  the  behavior  of  the  algorithm  on  classes  of  im¬ 
ages.  Examination  of  this  formula  bears  out  the  suspi¬ 
cion  that  suppressing  high  spatial  frequencies  improves 
the  estimate  of  the  disparity. 


The  third  improvement  is  also  discussed  in  (Lucas  81). 
It  is  based  on  the  observation  that  the  linear  approxi¬ 
mation  used  in  equation  (2)  is  more  or  less  accurate  at 
various  places  in  the  pictures,  id  that  the  accuracy 
can  to  some  extent  be  detected.  The  accuracy  can  be 
accounted  for  by  weighting  the  contribution  of  each 
point  (Xig)  in  the  sums  in  equations  (4)  and  (S)  by  an 
estimate  of  the  accuracy  of  the  linear  approximation  at 
that  point.  The  effect  of  this  weighting  is  not  to  ex¬ 
tend  the  range  of  acceptable  disparities,  but  rather  to 
improve  the  accuracy  of  the  calculated  estimate  of  the 
disparity.  For  more  details  see  (Lucas  81). 

4.  Am  analytical  remit 

An  analytical  result  concerning  the  effect  of  filtering 
the  images  is  possible.  Consider  a  one-dimensional 


5.  Experimental  results 

Depth  maps  were  obtained  from  successive  iterations 
of  the  algorithm  applied  to  a  random  dot  stereogram 
with  a  raised  square  in  the  middle.  The  stereogram  is 
shown  in  figure  1 .  The  resulting  depth  maps  are  shown 
in  the  left  halves  of  figures  2  through  5.  The  depth  map 
is  shown  as  an  "image"  with  the  gray  value  encoding 
the  disparity.  Each  successive  figure  shows  the  results 
of  applying  the  algorithm  to  successively  less  blurred 
versions  of  .  te  images,  using  previous  depth  maps  for 
initial  estimates.  On  the  right  side  of  each  image  is 
shown  a  "reliability  map"  which  indicates  the  reliability 
of  the  depth  map  at  that  point;  white  indicates  high 
reliability,  black  low  reliability.  The  reliability  at  each 
point  is  based  on  the  sum  of  the  weighting  factors  men¬ 
tioned  in  section  3.  Note  the  low  reliability  computed 
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at  the  boundary  of  the  square,  where  the  region  of 
summation  at  each  point  includes  regions  of  different 
disparities.  Note  also  the  thick  bar  of  unreliability  at 
the  left  side  of  the  square  where  the  background  be¬ 
hind  the  square  is  not  visible  to  both  eyes  and  there¬ 
fore  the  disparity  is  undefined. 

The  success  of  the  algorithm  has  also  been  demonstrat¬ 
ed  on  random-dot  stereograms  with  smoothly-changing 
disparities.  Some  success  has  also  been  attained  on 
natural  images,  although  such  results  are  difficult  to 
assess  because  of  the  lack  of  ground-truth  data. 

6.  Conclusions  and  Future  work 

A  technique  has  been  demonstrated  for  efficiently  cal¬ 
culating  the  depth  map  based  on  a  linear  approximation 
to  the  image  at  each  point.  The  algorithm  has  been 
further  improved  by  the  addition  of  smoothing,  itera¬ 
tion,  and  weighting.  A  theoretical  result  has  been  pres¬ 
ented  to  show  why  smoothing  helps.  Experimental 
results  have  been  presented  which  demonstrate  that  the 
algorithm  works  for  random  dot  stereograms.  Prelimi¬ 
nary  evidence  shows  promise  for  natural  scenes. 

The  algorithm  can  be  improved  in  several  ways.  As 
stated,  it  does  not  do  well  in  the  vicinity  of  depth  dis¬ 
continuities.  Moreover  an  initial  estimate  must  be 
provided  in  some  way.  This  suggests  combining  the 
algorithm  with  an  algorithm  based  on  feature  points  to 
provide  the  initial  estimate.  As  stated,  the  algorithm 
does  not  deal  well  with  brightness  variations  between 
equivalent  portions  of  the  left  and  right  images  (due  to 
film  and  camera  differences,  and  to^.specular  reflec¬ 
tions).  However,  a  technique  was  given  in  (Lucas  81) 
for  dealing  with  this  problem.  As  stated  above,  the 
algorithm  may  fail  in  featureless  regions,  although  these 
can  be  detected  (by  a  small  denominator  in  equation 
(4)).  What  to  do  at  such  points  is  not  clear.  Finally, 
the  theoretical  results  presented  in  section  4  should  be 
extended  to  the  weighted  case. 
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Abstract 

In  a  recent  paper  [Lowry  81),  we  described 
an  architecture  for  a  computer  vision  rectangular 
processor  array  that  is  suitable  for  VLSI  implemen¬ 
tation.  In  this  paper  we  will  review  that  architec¬ 
ture  and  discuss  extensions  to  it  and  present  results 
of  an  array  simulator  applied  to  vision  algorithms. 
We  will  also  present  an  algorithm  for  re-routing  an 
array  with  bad  processors  into  a  working  subset  of 
the  array,  making  it  feasible  to  implement  a  large 
array  on  one  wafer-sized  chip. 


Overview 

Several  groups  have  implemented  rectangular 
arrays  of  parallel  processors  for  image  processing 
applications  [Duff  81).  Most  of  these  processors 
have  been  designed  for  implementation  using  only 
a  few  processing  elements  per  chip.  For  reasons  of 
economy,  size,  low  power  consumption,  reliability, 
speed,  and  simplicity  of  design,  our  approach  is 
to  implement  an  entire  array  of  relatively  simple 
processors  on  one  wafer-size  chip. 

Using  a  single  wafer  implementation  results 
in  several  benefits.  By  keeping  signal  lines  on- 
chip  fan  out  is  decreased  thus  enhancing  speed 
and  density  while  lowering  power  consumption.  A 
wafer  implementation  also  makes  the  best  utiliza¬ 
tion  of  smaller  feature  sizes,  since  higher  density 
makes  possible  implementation  of  a  larger  array 
and  speeds  up  switching  times  proportionately.  At 
present  the  number  of  processors  that  can  be  put 
on  a  sub-array  is  limited  to  the  number  of  pins 
on  an  IC  package,  unless  communication  between 
processing  elements  on  different  chips  is  time  multi¬ 
plexed.  Thus  there  is  no  intermediate  step  between 


small  sub-arrays  of  processing  elements  and  a  single 
wafer-scale  implementation  of  an  entire  array.  A 
single  wafer  size  chip  also  provides  orders  of  mag¬ 
nitude  savings  in  power  and  space,  which  can  be 
particularly  important  for  autonomous  and  mobile 
applications. 


Architecture 

The  architecture  we  are  describing  is  a 
synchronous  bit-serial  SIMD  rectangular  array  with 
4-neighbor  connectivity.  The  array  size  is  matched 
to  the  image  being  processed,  simplifying  control  of 
the  array  and  reducing  the  amount  of  memory  that 
each  processing  element  needs.  Control  is  in  the 
form  of  microcode  (register-transfer  level)  signals 
shared  among  all  processing  elements. 

Figure  1  shows  the  design  of  one  processing  ele¬ 
ment.  The  processing  element  is  bit-serial,  so  the 
data  paths  in  the  figure  are  one  bit  wide.  Memory 
is  implemented  as  a  set  of  serial  shift  registers, 
shifting  from  MSB  to  LSB  with  the  MSB  coming 
from  a  data  selector  and  the  LSB  going  to  the  arith¬ 
metic  unit.  There  are  two  8-bit  D  registers  for  hold¬ 
ing  operands  and  a  16-bit  R  register  for  storing  a 
result.  The  arithmetic  unit  is  a  full  adder  with 
the  carry  feeding  back  into  the  adder  through  a 
register.  Operands  for  the  adder  come  from  two  in¬ 
put  multiplexors  that  can  select  from  the  constants 
“one”  or  “zero”,  the  D  registers  or  the  R  register, 
or  the  sum  of  the  arithmet'*'  unit  of  a  neighbor¬ 
ing  processing  element  (through  a  register).  Results 
from  the  adder  go  to  an  output  selector  where  they 
can  be  routed  to  any  of  the  registers.  The  output 
selector  loads  the  MSBs  of  all  regist  ers;  one  register 
can  be  loaded  from  the  adder  output  while  the  other  ‘ 

registers  are  loaded  from  their  own  LSBs.  This  al-  j 

lows  registers  to  be  used  as  operands  without  los¬ 
ing  their  original  contents.  For  doing  conditional  < 
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operations,  there  is  an  S  register  that  can  be  loaded 
from  the  adder  output.  Tf  the  S  register  is  a  “zero”, 
shifting  is  disabled  in  the  data  . egisters  regard¬ 
less  of  the  state  of  the  control,  effectively  disabling 
the  processing  element.  The  processing  clement  is 
controlled  by  signals  common  to  all  processing  ele¬ 
ments.  Control  signals  can  route  adder  inputs  and 
outputs,  load  the  S  register,  and  independently  shift 
data  registers. 

This  processing  element  was  implemented  in 
nMOS  and  fabricated  (see  figure  2)  at  the  Stanford 
integrated  circuit  facilities  in  late  1980.  The  chip 
was  tested  and  found  to  be  operational  using  an  F- 
15  module  test  stand  at  Hughes  Aircraft  Company. 
An  analysis  based  on  Carver  and  Mead’s  design 
rules  (Mead  80)  indicates  that  the  chip  should 
execute  internal  operations  at  a  10  MHz  clock 
rate.  However,  limitations  inherent  in  the  tester 
precluded  testing  the  maximum  clock  rate. 


Algorithm  Feasibility 

An  examination  of  algorithm  feasibility  was 
done  using  the  Grinncll  image  processor  to  expose 
problems  that  would  arise  in  detailed  implementa¬ 
tion  of  various  vision  algorithms  in  a  processor  ar¬ 
ray.  Although  these  experiments  are  preliminary, 
the  resulting  algorithm  implementations  run  much 
faster  and  have  generated  new  interest  in  special 
implementations  of  vision  algorithms. 


Figure  1.  Processing  element  design. 


Figure  2.  Processing  element  implementation. 
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The  Grinnell  image  processor  is  an  option  for 
the  Grinnell  GMR  image  display  system.  Our 
Grinnell  at  Stanford  has  four  channels  of  8  bit 
memory  configured  as  112  X  512  displays,  and  four 
1-bit  512  X  512  “overlay”  channels.  The  image 
processor  is  a  16-bit  ALU  that  takes  operands  rom 
four  8-in  8-out  lookup  tal  bs.  Lockup  table  puts 
can  be  selected  from  any  of  the  image  memories, 
and  their  outputs  can  be  complemented  or  set  to 
“all  zeroes”  or  “all  ones”.  The  ALU  can  perform 
a  number  of  arithmetic  and  logical  operations,  and 
its  16  bit  output  can  be  written  into  different  image 
memones  (with  some  minor  restrictions).  In  addi¬ 
tion,  overlay  channels  can  be  written  from  several 
1-bit  sources  (including  the  MSB  and  the  carry  out 
of  the  ALU).  Two  sets  of  control  registers  provide 
the  ability  to  use  an  overlay  channel  to  select  be¬ 
tween  different  operations  cn  a  pixel  -by-  pixel  basis. 
Since  memory  operations  are  synchronized  to  the 
display,  the  image  processor  finishes  an  operation 
in  one  frame  time.  In  general,  this  image  proces¬ 
sor  is  fairly  fast  and  flexible  for  integer  operations 
on  the  image  memories,  and  more  detail  about  its 
operation  can  be  found  in  (Grinnell  80]. 

Another  option  on  the  Grinnell,  the  zoom/pan 
card,  allows  an  image  to  be  shifted  spatially  before 
being  used  by  the  image  processor.  Using  this  op¬ 
tion  allows  programming  local  operations  similar  to 
those  that  are  well-suited  for  the  processor  array 
discussed  n  the  last  section. 

We  have  programmed  the  Grinnell  to  imple¬ 
ment  the  edge-magnitude  and  edge-thinning 
steps  of  the  Nevatia-Babu  line-finding  algorithm 
described  in  (Nevatia  78].  The  edge-magnitude 
.,tcp  convolves  the  original  image  with  six  step-edge 
masks  at  30  degree  incremental  rotations,  then  finds 
the  direction  and  value  of  tne  convolution  with  the 
largest  absolute  value  at  each  pixel.  The  edge¬ 
thinning  step  then  selects  pixels  that  arc  flanked 
(in  the  direction  orthogonal  to  the  edge  direction) 
by  pixels  which  have  a  smaller  edge  magnitude  and 
an  edge  direction  within  30  degree  of  the  central 
pixel.  These  pixels  are  labeled  edges,  and  they  sup¬ 
press  the  labeling  of  the  two  pixels  in  the  direction 
orthogonal  to  their  edge  direction  (this  results  in 
single-pixel  width  lines).  Figure  3  shows  an  image 
and  the  resulting  thinned  edges  (these  images  were 
taken  directly  from  the  Grinnell  system). 

One  of  the  things  that  this  experiment  shows 
is  the  value  of  specialized  systems  for  computer 
vision  algorithms  like  this  one.  By  taking  ad¬ 
vantage  of  a  large  memory  and  a  regular  addressing 
scheme,  the  Grinnell  completes  these  two  steps  of 


Figure  3.  Image  and  result  of  Nevatia-Babu  algo¬ 
rithm. 
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the  Nevatia-Babu  algorithm  in  under  15  seconds. 
Our  experience  shows  that  the  same  algorithm 
(the  original  version  from  University  of  Southern 
California)  takes  about  30  minutes  to  run  on  the 
same  size  image  using  a  lightly-loaded  timeshared 
KL-10  This  speedup  of  two  decimal  orders  of  mag¬ 
nitude  has  prompted  us  to  begin  implementation  of 
other  algorithms  in  use  at  Stanford  on  the  G.'innell 
system,  and  is  a  good  indication  of  the  utility  of 
further  research  on  our  processor  array. 


Simulator 

To  better  understand  the  limitations  of  the 
processor  arrav  in  implementing  algorithms  and  to 
get  some  specific  data  on  the  running  times  of  these 
algorithms,  we  have  implemented  a  simulator  of  the 
processor  array.  The  simulator  has  proved  useful  in 
suggesting  design  modifications  and  extensions  to 
the  original  procossing  element  to  make  it  easier  to 
program  and  faster  in  executing  vision  algorithms. 

Since  the  original  processing  element  was  de¬ 
signed  to  support  an  unsigned  convolution,  this  al¬ 
gorithm  is  a  natural  first  step  in  simulation.  Figure 
4  shows  a  segment  of  a  line  and  the  result  of  con¬ 
volving  it  with  a  3  X  3  mask  with  all  mask  entries 
set  to  1.  (This  process  is  a  rather  crude  method  for 
filtering  lines  to  display  thsrn  on  a  multi-bit  display 
device).  The  convolution  was  simulated  with  the 
original  processing  element  configuration  described 
in  [Lowry  81]  and  executed  in  565  clock  cycles.  The 
estimated  10  MHz  clock  rate  would  result  in  an  ex¬ 
ecution  time  of  56.5  fiaec. 

Although  it  is  possible  to  use  the  original 
processing  element  configuration  to  do  a  signed 
convolution,  the  addition  of  a  third  D  register 
greatly  simplifies  the  task  by  providing  storage  for 
a  precomplemented  multiplicand  (alternatively,  ex¬ 
tra  processing  steps  could  be  used  to  complement 
the  multiplicand  when  needed).  With  this  exten¬ 
sion  to  the  processing  element,  the  simulator  was 
programmed  to  do  a  signed  convolution  using  two’s- 
complement  arithmetic.  Figure  5  shows  an  image 
and  the  lateraliy  inhibited  image  obtained  by  cc  v 
volving  it  with  a  5  X  5  mask  with  all  mask  entries 
set  to  —1  except  for  the  center  which  is  set  to  24. 
The  simulated  convolution  takes  1. -57  clocks,  yield¬ 
ing  an  execution  time  of  1.36  msec. 


Figure  4.  Image  and  result  of  unsigned  convolution. 
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Fault- tolerance 


Integrated  circuit  yield  decreases  exponentially 
with  area.  In  order  to  make  a  wafer-scale  in¬ 
tegrated  circuit  fault-tolerant,  it  is  necessary  to  in¬ 
corporate  redundant  elements  and  reconfigurablc 
interconnections.  For  a  static  array  where  the  inter- 
processor  communication  overhead  needs  to  be 
kept  low,  non-volatile  hardware  is  the  appropriate 
choice  for  reconfiguration.  We  are  currently  con¬ 
sidering  laser-fusible  links,  the  re-routing  technol¬ 
ogy  being  developed  by  the  restructurable  VuSI 
project  at  Lincoln  Laboratories.  [Blankenship  82] 
describes  this  project  including  a  routing  algorithm 
for  embedding  a  set  of  one-dimensional  2-neighbor 
connected  arrays  into  a  2-dimensional  physical  ar¬ 
ray  with  dc'ects.  In  this  report  we  describe  a  rout¬ 
ing  algorithm  that  embeds  a  logical  2-dimensional 
array  with  4- neighbor  connectivity  into  a  physical 
2-dimensional  array  with  8-neighbor  connectivity. 

The  effectiveness  of  routing  algorithms  to  suc¬ 
cessfully  reconfigure  a  chip  with  defective  elements 
depends  both  on  the  defect  rate  and  the  flexibility 
of  the  interconnection  scheme.  Although  data 
on  defect  rates  is  not  easily  available  from  most 
manufacturers,  we  can  roughly  estimate  the  defect 
rate  for  an  individual  processing  element  by  com¬ 
paring  its  size  to  the  area  of  other  chips  and  us¬ 
ing  the  exponential  model  of  yield.  Based  upon  a 
conservative  comparison,  we  expect  defect  rates  in 
processing  elements  to  be  between  5%  and  10%.  A 
reconfiguration  scheme  based  on  disconnecting  en¬ 
tire  columns  containing  a  defective  processing  ele¬ 
ment  would  be  unacceptable  for  large  arrays.  For 
example,  only  0.14%  of  the  columns  in  a  128  X 
128  array  would  be  defect-free  with  a  processing 
element  defect  rate  of  5%.  A  routing  algorithm 
that  operates  at  the  level  of  individual  processing 
elements  rather  than  large  groups  of  processing  ele¬ 
ments  is  needed. 


Routing  Algorithm 


The  routing  scheme  we  developed  maps  an 
TV  X  TV  logical  array  onto  an  M  X  TV  physical  arra;. 
of  processing  elements,  where  M  >  TV .  Columns 
in  the  logical  array  are  mapped  onto  columns  in 
the  physical  array,  while  rows  in  the  logical  array 
are  mapped  onto  paths  that  follow  diagonal  and 
horizontal  links  in  the  physical  array.  Defective 
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processing  elements  are  skipped  along  columns, 
while  they  are  detoured  along  rows.  Figures  6  and 

7  show  the  redundant  links  for  connections  along 
rows  and  columns  respectively,  with  marked  loca¬ 
tions  indicating  where  a  laser  would  fuse  unwanted 
links.  Since  physical  columns  correspond  to  logi¬ 
cal  columns,  only  extra  rows  are  needed.  Figure 

8  shows  a  10  X  10  logical  array  mapped  onto  a 
physical  array  with  a  20%  defect  rate  using  5  extra 
rows.  The  algo  ithm  starts  with  the  bottom  rpw 
in  the  logical  array  and  finds  a  route  for  each  suc¬ 
cessive  row  in  the  logical  array.  For  each  row,  a 
route  is  found  from  left  to  right  using  the^diagonal 
and  horizontal  links,  keeping  as  close  to  the  bot¬ 
tom  of  the  physical  array  as  possible.  If  a  dead 
end  is  reached  while  routing  a  row  the  algorithm 
backtracks  along  the  row.  Once  a  row  is  routed, 
it  stays  fixed.  Each  processing  element  makes  a 
connection  to  its  logical  right  neighbor  by  attempt¬ 
ing  physical  connections  to  the  lower  right,  right, 
or  upper  right,  in  that  order.  A  connection  fails 
if  the  processing  element  being  connected  to  is  al¬ 
ready  assigned,  is  defective,  or  leads  to  a  previously 
determined  dead  end.  The  algorithm  is  linear  in  the 
number  of  processing  elements  since  each  element  is 
backtracked  over  at  most  once.  The  running  time 
on  a  KL-10  for  a  128  X  128  array  with  a  20%  defect 
rate  is  1.5  seconds. 


Figure  7.  Redundant  links  along  columns. 
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Figure  8.  Rerouted  10  X  10  array.  Black  1 
are  defective  processors.  Boxes  with  vertical 
through  them  arc  unused. 
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Analysis  of  Routing  Algorithm 

A  lower  bound  on  the  number  of  extra  rows 
needed  to  route  an  N  X  N  logical  array  is  the  ex¬ 
pected  maximum  number  of  defects  along  columns 
in  the  physical  array.  The  probability  distribution 
for  this  function  is  a  cumulative  binomial  distribu¬ 
tion  raised  exponential!  to  the  number  of  columns. 
The  exponential  causes  the  probability  distribution 
to  asymptotically  approach  a  vertical  step  edge. 
Thus  there  is  little  difference  between  the  number  of 
rows  needed  to  insure  that  25%  of  tl.-r  wafers  can  be 
successfully  routed  and  the  number  needed  for  90% 
successful  routing.  This  probability  function  was 
used  in  computing  theoretical  lower  bounds  on  the 
number  of  redundant  rows  needed  to  insure  90% 
successful  routing  of  a  wafer  as  a  function  of  the 
processing  element  defect  rate  and  the  size  of  the 
logical  array.  Figure  9  compares  these  lower  bounds 
with  results  from  Monte  Carlo  simulations. 

Thousands  of  simulations  were  run  to  verify 
the  viability  of  the  routing  algorithm.  For  defect 
rates  of  less  than  10%  the  empirical  results  are 
in  excellent  agreement  with  the  theoretical  lower 
bound.  For  defect  rates  greater  than  20%  the  em¬ 
pirical  results  become  increasingly  larger  than  the 
theoretical  lower  bound.  It  would  appear  that  the 
number  of  dead  end  routes  becomes  the  dominant 
factor  with  large  defect  rates  The  results  show 
that  for  a  fixed  defect  rate  the  number  of  extra 
rows  needed  to  insure  90%  mutability  is  a  decreas¬ 
ing  percentage  of  the  linear  dimension  of  the  logical 
array.  This  percentage  is  roughly  16%  for  a  5% 
defect  rate,  34%  for  a  10%  defect  rate,  and  88% 
for  a  20%  defect  rate.  Thus  this  routing  scheme 
appears  viable  for  defect  rates  up  to  10%,  and  per¬ 
formance  increases  with  larger  array  dimensions. 


Figure  9.  The  ratio  of  total  rows  needed  /  rows 
routed  to  insure  that  90%  of  the  wafers  can  be 
successfully  routed.  This  ratio  is  a  function  of 
th<'  defect  rate  for  individual  processing  elements 
and  the  size  of  the  N  X  /V  array  that  is  logically 
implemented.  Each  entry  consists  of  the  empirically 
derived  ratio  on  the  left  and  the  theoretical  lower 
bound  on  the  right. 


Evaluation  of  Routing  Scheme 

The  routing  scheme  has  the  advantages  that 
only  extra  rows  are  needed,  the  reconnections  are 
relatively  simple  compared  to  those  required  for  ar¬ 
bitrary  reconnections  between  neighbors,  and  it  is 
compatible  with  laser  technology  being  developed 
for  restructuring  wafer-scale  VLSI  chips.  The  al¬ 
gorithm  is  simple,  efficient,,  and  linear  in  the  num¬ 
ber  of  processing  elements,  even  though  the  general 
problem  of  finding  a  subgraph  isomorphism  is  NP- 
complete.  [Garey  79] 

For  defect  rates  higher  than  10%,  this  scheme 
could  be  extended  to  allow  skipping  along  rows 
and  routing  vertical  links  along  diagonals.  Extra 
rows  and  columns  would  be  needed,  but  the  scheme 
would  no  longer  be  limited  by  the  maximum  num¬ 
ber  of  defects  along  a  column.  A  more  sophisticated 
routing  algorithm  would  be  used. 
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Although  the  algorithm  was  designed  and 
tested  assuming  no  defects  in  the  interconnection 
hardware,  simple  modifications  would  enable  it  to 
route  around  bad  interconnections.  Defects  in  the 
horizontal  and  diagonal  interconnections  would  be 
incorporated  in  the  test  to  connect  to  a  right  neigh¬ 
bor.  Defects  in  interconnections  between  vertically 
adjacent  neighbors  would  be  handled  by  treating 
one  of  the  adjacent  processing  elements  as  defec¬ 
tive.  Defects  in  the  interconnections  that  skip 
defective  processing  elements  along  columns  can¬ 
not  be  tolerated;  however,  considerable  redundancy 
can  be  incorporated  into  these  interconnections  by 
fabricating  two  lines  and  connecting  them  at  each 
processing  element  with  a  fusible  link.  Any  break  in 
one  line  will  then  be  compensated  for  by  the  other 
line. 


Summary 

In  this  paper  we  reviewed  an  architecture  for 
a  computer  vision  rectangular  processor  array  that 
is  suitable  for  VLSI  implementation.  The  results 
of  an  array  simulator  applied  to  convolutions  show 
that  the  speed  and  internal  memory  of  the  bit-serial 
processing  elements  is  well  suited  to  vision  algo¬ 
rithms.  Rough  calculations  indicate  a  defect  rate 
between  5%  and  10%  for  individual  processing  ele¬ 
ments.  A  routing  scheme  suitable  for  these  defect 
rates  and  large  arrays  was  implemented  and  sub¬ 
jected  to  extensive  simulation.  The  results  indicate 
the  feasibility  of  implementation  on  a  wafer-scale 
chip  using  technology  being  developed  for  restruc- 
turable  VLSI. 
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ABSTRACT 


Several  segmentation  techniques  were  applied 
to  a  set  of  51  FLIR  (Forward-Looking  InfraRed) 
Images  of  four  different  types ,  and  the  results 
were  compared  to  hand  segmentations.  There  were 
substantial  differences  in  performance.  Indicating 
that  the  choice  of  proper  technique  is  very  impor¬ 
tant.  The  segmentation  techniques  used  were 
"superslice",  "pyramid  spot  detection",  two  ver¬ 
sions  of  "relaxation",  "pyramid  linking",  and 
"superspike".  One  technique,  "superspike",  out¬ 
performed  all  the  others,  detecting  88%  of  the 
targets  and  yielding  only  1.6  false  alarms  per 
true  target. 

1.  Introduction 


Object  detection  in  infrared  images  is  a 
problem  of  considerable  practical  interest  (1). 
Numerous  techniques  have  1  een  developed  for  the 
primary  purpose  of  segmenting  FLIR  (-Forward 
Looking  InfraRed)  images  into  objects  and  back¬ 
ground  (e.g.,  [1,2]);  in  particular,  [31  is  a 
survey  of  such  techniques,  and  [4]  describes  a 
comparative  study.  This  paper  summarizes  the 
results  of  another  comparative  study;  further 
details  about  the  study  can  be  found  in  [5], 

Section  2  describes  the  segmentation  tech¬ 
niques  that  were  tested;  Section  3  describes  the 
evaluation  procedure;  and  Section  4  summarizes  the 
results  of  the  study. 


components  of  above-threshold  points  are  extracted. 
The  gray  level  gradient  is  also  measured  for  the 
image,  and  points  at  which  it  is  a  local  maximum 
are  determined,  A  component  is  selected  as  a 
possible  object  if  many  gradient  maxima  coincide 
with  its  border  and  surround  it. 

2.2  Pyramid  spot  detection  [7] 

This  technique  is  designed  to  extract  compact 
objects  of  arbitrary  size  from  an  image;  it  too 
performed  veil  in  earlier  studies.  We  build  an 
exponentially  tapering  "pyramid"  of  reduced- 
resolution  versions  of  the  image  by  successive 
block  averaging,  e.g.,  using  nonoverlapping  2x2 
blocks,  or  4x4  blocks  with  50%  overlap  in  each 
direction,  so  that  each  image  is  half  the  size 
(k  the  area)  of  the  preceding.  At  each  level  of 
the  pyramid,  we  apply  a  standard  spot-detection 
operator  -  e.g.,  we  compare  each  pixel  to  its 
eight  neighbors,  and  judge  a  spot  to  be  present  if 
they  differ  sufficiently.  A  spot  that  is  detected 
in  this  way  should  correspond  to  a  compaci  object 
on  a  contrasting  background  in  the  original  image. 
For  each  such  spot,  we  consider  the  portion  of  the 
original  image  corresponding  to  the  pixel  and  its 
neighbors,  and  apply  a  threshold  to  this  portion, 
chosen  midway  between  the  gray  level  of  the  pixel 
(an  average  of  a  block  of  gray  levels  in  the 
original  image)  and  the  average  gray  level  of  its 
neighbors  (an  average  of  block  averages) .  This 
thresholding  generally  extracts  the  object  that 
gave  rise  to  the  spot  detection. 


2.  Segmentation  techniques 

The  techniques  tested  are  briefly  described  in 
the  following  paragraphs;  for  further  details  see 
the  cited  references. 

2.1  Supers lice  [6] 

This  technique  was  quite  successful  in 
earlier  studies  of  FLIR  object  detection  [1],  A 
set  of  gray  level  thresholds  is  applied  to  the 
given  image,  and  for  each  threshold,  connected 


The  support  of  the  Defense  Advanced  Research 
Projects  Agency  and  the  U.S.  Army  Night  Vision 
Laboratory  under  Contract  DAAG-53-76C-0138 
(DARPA  Order  3206)  is  gratefully  acknowledged,  as 
is  the  help  of  Clara  Robertson  in  preparing  this 
paper. 


2.3  Relaxation  [8] 

"Relaxation"  methods  of  object  extraction 
have  been  extensively  studied.  The  basic  approach 
is  to  initially  assign  "object"  and  "background" 
probabilities  to  each  pixel,  based  on  their  dis¬ 
tances  from  the  ends  of  the  grayscale.  The  pro¬ 
babilities  are  then  iteratively  adjusted  based  on 
the  probabilities  of  the  neighboring  pixels,  with 
like  reinforcing  like.  When  this  is  done,  the 
probabilities  tend  to  converge  to  relative  cer¬ 
tainty  ((0,1)  or  (1,0)),  ar.d  yield  a  good  segmen¬ 
tation  of  the  image  into  objects  and  background. 

An  alternative,  also  investigated,  used  three 
rather  than  two  classes,  assigning  initial  proba¬ 
bilities  based  on  distances  from  the  ends  and  mid¬ 
point  of  the  grayscale;  thus  the  pixels  were  not 
forced  to  choose  between  "target"  and  "background", 
but  also  had  a  third  option  ("clutter"). 
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2.4  Pyramid  linking  [9] 

This  is  a  method  of  segmenting  an  image  based 
on  creating  links  between  pixels  at  successive 
levels  of  a  "pyramid".  We  build  the  pyramid  using 
overlapping  4x4  blocks;  thus  each  pixel  has  16 
"sons"  (on  the  level  below)  that  contribute  to  its 
average,  and  four  "fathers"  (on  the  level  above) 
to  whose  average  it  contributes.  We  now  link  each 
pixel  to  the  father  whose  value  (“average)  is 
closest  to  its  own.  We  then  recompute  the  aver¬ 
ages,  allowing  only  those  sens  that  are  linked  to 
a  pixel  to  contribute  to  its  average.  We  now 
change  the  links  based  on  these  new  averages,  then 
recompute  the  averages  again,  and  so  on.  This 
process  stabilizes  after  a  few  iterations;  at  this 
stage  the  links  define  subtrees  of  the  pyramid, 
rooted  at  the  top  level,  which  we  take  to  be  2x2, 
so  that  there  are  (at  most)  four  trees.  The  sets 
of  leaves  of  these  trees  (pixels  in  the  original 
image)  thus  define  a  segmentation  of  the  original 
image  into  at  most  four  subsets. 

2.5  "Superspike"  (10] 

This  is  a  method  of  image  smoothing  based  on 
iterated  selective  local  averaging.  Each  pixel  is 
averaged  with  those  of  its  neighbors  that  satisfy 
the  following  criteria,  based  on  the  image's 
histogram: 

a)  The  neighbor  is  more  probable  than  the 
pixel,  i.e.,  its  gray  level  has  a  higher 
value  in  the  histogram. 

b)  The  histogram  has  no  concavity  between  the 
gray  levels  of  the  pixel  and  the  neighbor 
(as  would  be  the  case  if  they  belonged  to 
two  different  peaks,  or  to  a  peak  and 
shoulder) . 

When  this  process  is  iterated  a  few  times,  the 
histogram  generally  turns  into  a  small  set  of 
spikes.  The  image  can  then  be  segmented  by  map¬ 
ping  them  into  nearby  taller  ones,  until  only  five 
spikes  remained,  thus  segmenting  the  image  into 
five  subsets.  The  choice  of  five  classes  was  an 
arbitrary  one,  based  on  preliminary  experiments 
in  which  it  was  found  that  using  fewer  classes 
tended  to  merge  some  objects  into  the  background. 

3.  Methodology 

The  overall  approach  used  in  the  comparative 
study  was  as  follows: 

1)  Each  technique  being  tested  (Section  2)  was 
applied  to  the  given  set  of  images,  yield¬ 
ing  a  classification  of  each  image  Into 
subsets.  Connected  component  labelling 
was  performed  on  the  resulting  classified 
images,  yielding  a  set  of  regions. 


4  <  height  <  41 
3  -  width  5  50 


(pixels) 


0.4  5  aspect  ratio  5  2.0 


In  addition,  regions  having  the  wrong 
polarity  relative  to  the  mean  Image  gray 
level  were  eliminated. 

3)  For  each  surviving  region,  the  coordinates 
of  its  centroid  and  the  dimensions  of  its 
upright  circumscribing  rectangle  were 
computed.  The  centroids  and  circum- 
rectangles  of  the  true  targets  were  also 
known  (from  ground  truth  information  and 
hand  segmentation).  A  target  was  said  to 
have  been  detected  if  the  x  and  y  dis¬ 
placements  between  a  region  centroid  and  a 
true  target  centroid  were  at  most  half  the 
true  target's  rectangle  dimensions. 

Region  centroids  not  satisfying  these  con¬ 
ditions  were  considered  to  be  false  alarms. 
The  "segmentation  accuracy"  for  each  de¬ 
tected  target  was  measured  by  the  fraction 
of  overlap  between  the  circumrectangle  of 
the  detected  region  and  that  of  the  true 
target.  "Extra  detections"  were  said  to 
occur  when  more  than  one  region  centroid 
occurred  in  the  inner  half  of  a  true 
target's  rectangle;  all  such  detections 
were  counted  in  computing  the  average  seg¬ 
mentation  accuracy.  These  methods  of 
evaluating  a  segmentation  were  proposed  in 
[3J. 


4.  Experiments 

In  a  pilot  study,  all  six  techniques  (including 
both  two-class  and  three-class  relaxation)  were 
applied  to  three  image  samples  (see  Figure  1). 
Figure  1  also  shows  the  resulting  segmented  images. 
We  see  that  the  pyramid  spot  technique  did  not 
perform  very  well.  This  is  not  too  surprising, 
since  this  technique  was  designed  for  the  extrac¬ 
tion  of  isolated  objects  on  a  contrasting  back¬ 
ground.  Results  with  the  relaxation,  pyramid 
linking,  and  superspike  techniques  looked  more 
promising,  and  it  was  therefore  decided  to  use  all 
of  them  in  the  main  study.  The  superslice  tech¬ 
nique  was  not  used  in  the  main  study  because  ol  its 
comparatively  high  computational  cost,  which  made 
its  use  relatively  impractical. 

The  main  study  used  a  set  of  51  FLIR  images 
supplied  by  Westinghouse  Systems  Development 
Division  [3]  from  Navy  (Nos.  2-10),  Army  (Nos. 
11-30,  55-70),  and  Air  Force  (Nos.  31-36)  sources 
(Figure  2).*  All  images  are  128x128;  Nos.  11-30 
were  obtained  from  64x64  images  by  horizontal 
and  vertical  reflection,  in  order  to  present  the 
targets  in  four  orientations. 


2)  Regions  that  were  too  large,  too  small,  or 
too  elongated  to  be  targets  were  eliminated. 
In  our  main  study,  the  criteria  fpr  accept¬ 
ability  were 


*  Further  Information  about  the  data  base  can  be 
obtained  from  Mr.  Bruce  J.  Schacter,  Westinghouse 
Systems  Development  Division,  Baltimore,  MD  21203. 
The  target  types  and  locations  are  listed  in 
Table  1. 
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The  four  selected  techniques  (two-  and  three- 
class  relaxation,  pyramid  linking,  and  superspike) 
were  applied  to  these  images.  [In  the  case  of 
images  21-30,  they  were  applied  to  only  one  quad¬ 
rant,  since  the  methods  are  essentially  orienta¬ 
tion-invariant;  the  scores  (detections  and  false 
alarms)  obtained  in  this  way  were  multiplied  by 

4. ]  The  pyramid  linking  algorithm  was  designed 
for  64x64  images;*  in  order  to  apply  it  to  images 
2-10,  31-36,  and  55-70,  they  were  resampled  down 
to  that  size,  and  the  outputs  (centroids  and 
rectangles)  were  scaled  up  in  order  to  compare 
them  with  the  ground  truth. 

Figure  3  shows  the  segmentation  results  using 
the  four  methods  for  each  of  the  51  images.  Table 
2  summarizes,  by  image  class,  the  number  of  targets 
present,  the  number  correctly  detected,  the  number 
of  extra  detections,  the  number  of  false  alarms, 
and  the  segmentation  accuracy.  Detailed  results 
for  the  51  individual  images  are  given  in  [5] . 

We  see  from  these  results  that  segmentation 
accuracy  does  not  vary  greatly  among  the  methods; 
it  ranges  between  about  .5  and  .8  in  all  cases. 
Extra  detections  are  also  not  a  significant  factor, 
except  perhaps  for  the  pyramid  linking  and  super¬ 
spike  methods  as  applied  to  the  NVL  data  (images 
11-30).  As  regards  correct  detections  and  false 
alarms,  3-class  relaxation  and  superspike  were  the 
best  methods  (though  no  method  was  very  good)  for 
the  Navy  images;  pyramid  linking  and  superspike  had 
good  detection  rates  for  the  NVL  data,  but  the 
former  had  a  much  higher  false  alarm  rate;  and 
superspike  was  by  far  the  best  method  for  the  Air 
Force  and  NVL  flight  test  images,  making  it  the 
best  method  overall.  It  detected  111  of  the  126 
targets  (over  882)  with  only  26  extra  detections 
and  202  false  alarms  (about  1.6  per  true  target), 
and  its  segmentation  accuracy  was  a  reasonable  0.66. 
The  next  best  method,  pyramid  linking  (which,  it 
should  be  recalled,  was  applied  to  half-resolution 
versions  of  Images  nos.  2-10,  31-36,  and  55-70), 
detected  only  632  of  the  targets  and  had  many  more 
false  alarms  (over  5  per  target).  For  further 
details  see  [5] . 

5.  Concluding  remarks 

The  results  of  the  main  study  show  that  one 
method,  "superpike",  performed  substantially 
better  on  the  Westinghouse  Jata  base  than  the  other 
methods  tested.  It  detected  nearly  902  of  the  true 
targets  and  gave  only  1,6  false  alarms  per  target. 
Note  that  these  results  were  obtained  using  seg¬ 
mentation  alone,  in  conjunction  with  very  crude 
size  and  height :width  criteria.  If  the  segmenta¬ 
tion  step  were  followed  by  a  classification  al¬ 
gorithm,  such  better  performance  could  be  expected. 

Some  further  improvement  in  performance 
can  undoubtedly  be  obtained  by  further 


•Extension  of  this  algorithm  to  128x128  Images 
is  straightforward,  but  would  Involve  excessive 
memory  requirements. 


refining  the  segmentation  process.  However,  there 
are  limits  to  what  can  be  achieved  in  this  way  by 
algorithms  that  incorporate  so  little  knowledge 
about  the  nature  of  the  targets.  In  order  to  attai 
a  significantly  higher  level  of  performance,  it 
will  probably  be  necessary  to  develop  a  knowledge- 
driven  system  capable  of  some  degree  of  reasoning 
about  the  regions  extracted  by  the  initial  segmen¬ 
tation. 
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Figure  2.  Images  used  in  main  btudy 
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Figure  3.  Results  of  main  study 
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Table  1.  Ground  truth  for  the  51  linages.  (C-.C,.) -centroid  coordi¬ 
nates;  (R«,Rj)-half-dimensions  of  circumrectangle.  In 
Images  2-30,  high  gray  levels  are  hot;  in  images  31-36 
and  55-70,  low  gray  levels  are  hot. 
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Table  2.  Summary  of  results  by  image  class. 
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brief  overview  of  the  major  software  contributions 
and  present  selected  results  of  the  evaluation 
process-  We  also  discuss  the  user  programs  and 
user  interfaces  supported  In  the  Testbed 
environment  and  present  tentative  plans  for  the 
future  evolution  of  the  Testbed. 


I  OVERVIEW  OF  THE  IMAGE  UNDERSTANDING  TESTBED 

A.  Background 

The  Image  Understanding  Testbed  was 
established  at  SRI  for  the  purpose  of  demonstrating 
and  evaluating  research  In  machine  vision  sponsored 
by  the  Defense  Advanced  Research  Projects  Agency 
(DARPA).  Applications  of  Image  Understanding  (1U) 
techniques  to  automated  cartography  and  military 
Image  Interpretation  tasks  have  been  Important 
components  of  the  DARPA  IU  research  program;  these 
areas  form  the  principal  focus  of  the  Testbed 
project.  A  number  of  computer  programs  developed 
by  participants  In  the  Image  Understanding  program 
have  been  transported  to  the  uniform  Testbed 
environment  for  examination.  These  Include  systems 
written  In  UNIX  C,  MAINSAIL,  and  FRANZLISP. 
Capabilities  of  the  computer  programs  Include 
segmentation,  linear  feature  delineation,  ahape 
detection,  stereo  reconstruction,  and  rule-based 
recognition  of  classes  of  three-dimensional 
objects- 

B.  System  Hardware 

The  principal  element  of  the  current  Testbed 
hardware  configuration  Is  a  DEC  VAX-11/780  central 
proceaalng  unit.  The  Testbed  VAX  la  on  the  ARPANET 
network  with  ARPANET  address  SRI-IU.  The  Testbed 
graphics  system  centers  around  a  Grlnnall  GMR275 
display  system.  A  DsAnsa  1P5532  display  system  Is 
also  available  at  SRI  for  research  and  development 
work.  A  DEC  2060  central  processing  unit  (SRI-AI 
on  the  ARPANET)  Is  accessible  from  the  Tsstbed  VAX 
via  the  ARPANET. 


The  VAX  Is  a  four-megabyte  sys.ea  with  one 
tape  dilve,  one  RP06  disk  drive,  three  C DC  9766 
disk  drives,  and  32  teletype  lines.  The  VAX 
Interfaces  directly  to  a  variety  of  terminals,  a 
digitizing  table,  a  menu  tablet,  the  Grlnnell 
display  system,  and  a  DEC  PDP-11/34  minicomputer 
that  controls  the  DeAnza  display  system.  Other 
peripherals  Include  a  Versatec  11-lnch 
printer/plotter  with  200-polnt/lnch  resolution,  and 
an  Optronics  C-4100  color  image  scanner  with 
resolution  selectable  from  12.5  to  400  microns. 
Several  different  computer  network  Interfaces  to 
the  Testbed  VAX  are  In  place  at  this  time:  an 
ARPANET  link  to  the  SRI  IMP,  a  temporary  CHAOSNET 
Interface  to  an  Interim  Lisp  Machine  system,  and  a 
borrowed  3-megablt/second  ETHERNET  network  to  link 
to  other  SRI  VAX  systems.  The  CHAOSNET  and  3- 
megablt  ETHERNET  links  will  be  replaced  eventually 
by  a  10-megablt  ETHERNET  Interface.  This  higher- 
bandwidth  link  will  be  used  to  communicate  with 
second-generation  Lisp  Machines  and  other  local 
computer  systems. 

The  Grlnnell  display  system  has  a  resolution 
of  512  x  512  with  32  bits  per  pixel.  Special 
features  Include  Individual  zoom,  pan,  and  color 
lookup  tables  on  each  group  of  8  bit  planes.  Tn*. 
bit  planes  can  be  arranged  to  fora 
1024  x  1024  x  8  Image  of  which  any  512  x  512 
portion  can  be  displayed.  The  Grlnnell  system  la 
primarily  dedicated  to  the  support  of  Testbed 
functions. 

The  DeAnza  refreehed-raster-scan  display 
system  also  has  a  resolution  of  512  x  512  with  32 
bits  per  pixel.  This  syt  tern  Is  used  mainly  for 
research  and  development*  Eight  bits  each  are 
allocated  to  red,  green,  and  blue  data,  and  In 
addition  there  are  eight  overlay  planes.  SRI  has 
Installed  a  special  video  crossbar  system  to  allow 
the  DeAnza  bit  planes  to  be  allocated  dynamically 
among  our  two  color  monitors  and  up  to  eight  remote 
monochrome  monitors.  All  DeAnza  graphics 
operations  are  carried  out  by  a  PDP-11/34  under  the 
direction  of  the  VAX. 


C.  System  Software 

The  Testbed  system  runs  under  the  VAX/VMS 
operating  system,  using  the  SRl-davelopad  EUNICE 
software  package  to  emulate  UNIX  operating-system 
services  and  to  support  software  developed  under 
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UNIX.  (UNIX  Is  a  trademark  of  Bell  Laboratories.) 
This  combination  of  operating-system  support 
permits  compatibility  with  both  UNIX  environments 
and  other  VMS/EUNICE  environments.  In  principle 
all  Testbed  applications  software  can  be  run  either 
on  UNIX  or  VMS/EUNICE  systems  provided  that 
appropriate  system-specific  hardware  device  drivers 
are  available. 

The  DEC  2060  presently  associated  with  the 
Testbed  VAX  runs  under  the  T0PS-20  operating 
system;  this  facility  Is  available  to  run 
application  programs  developed  on  PDP-10  computers 
that  cannot  be  transported  easily  to  run  on  the 
Testbed  system  Itself. 

D.  Language  Support 

The  high-level  programming  languages  currently 
used  on  the  Testbed  are  UNIX  C,  MAINSAIL,  and 
FRANZLISP.  Both  DEC  and  Berkeley  UNIX  versions  of 
the  FORTRAN  and  PASCAL  language  compilers  are 
available  but  are  not  used  In  any  contributed 
software.  The  DEC  C  language  compiler  can  in 
principle  be  used  Instead  of  the  UNIX  C  compiler 
for  some  applications  under  the  VMS  operating 
system.  In  addition  to  FRANZLISP,  the  ISI  VAX 
INTERLISP  dialect  and  an  experimental  version  of 
the  MIT  NIL  LISP  dialect  are  also  available.  Lisp 
Machine  LISP  will  also  be  available  when  Lisp 
Machines  are  Integrated  into  the  Testbed  system. 

The  Testbed  graphics  capabilities  have  been 
implemented  for  the  DeAnza  In  MAINSAIL.  The 
Grlnnell  graphics  capabilities  are  fully  supported 
In  C,  using  software  baaed  originally  on  the  CMU 
Grlnnell  graphics  package.  FRANZLISP  and  MAINSAIL 
programs  may  access  the  Grlnnell  by  means  of  the  C 
Grlnnell  graphics  package. 


II  CHARACTERISTICS  OF  CONTRIBUTED  SOFTWARE 

A.  Overview  of  Contributions 

In  addition  to  SRI  International,  the 
Institutions  contributing  software  systems  to  the 
Image  Understanding  Testbed  are  Ca rnegle-Mellon 
University,  the  Massachusetts  Institute  of 
Technology,  StanfotJ  University,  the  University  of 
Maryland,  the  University  of  Rochester,  and  the 
University  of  Southern  California. 

The  IU  Testbed  software  environment  Is  now 
reaching  maturity.  Software  modules  Integrated 
into  the  system  Include  libraries  of  user 
utilities,  graphics  routines,  and  Image  access 
routines.  Each  of  the  designated  Testbed 
contributor  sites  has  defined  and  delivered,  or 
arranged  for  delivery  of,  contributions  to  the 
Testbed  system.  Among  the  research  contributions 
are  two  modules  from  SRI  and  two  from  CMU;  also 
running  on  the  Testbed  are  one  contribution  each 
from  Rochester,  Maryland,  and  USC,  as  well  as  a 
major  system  In  FRANZLISP  from  Stanford.  The  MIT 
contributions  In  Lisp  Machine  LISP  must  await  the 
delivery  of  Lisp  Machines  to  the  Testbed.  CMU  has 
also  furnished  utilities,  graphics,  and  picture 
access  packages,  while  SRI  has  Implemented  an 
extended  picture  format  and  many  user  utilities. 

A  summary  of  the  currently  operational 
research  software  contributions  Is  given  In  Table 
1. 


E.  Personnel 

The  IU  Testbed  program  has  been  carried  out  In 
the  Artificial  Intelligence  Center  of  SRI 
International's  Computer  Science  and  Technology 
Division  under  the  general  supervision  of  Dr.  Nils 
J.  Nilsson.  Dr.  Martin  A.  Flachler  has  been  the 
project  leader.  Dr.  Andrew  J.  Hans  in  has  acted  as 
the  IU  Testbed  project  coordinator.  Staff  members 
principally  concerned  with  Testbed  activities  have 
Included  D.  L.  Kashtan  and  Dr.  F.  I.  Laws.  Other 
members  of  the  SRI  research  staff  who  have  made 
essential  contributions  to  the  Testbed  Include 
Dr.  S.  T.  Barnard,  Dr.  R.  C.  Bolles, 
Dr.  A.  P.  Pent land.  Dr.  G.  B.  Smith,  and 
H.  C.  Wolf.  Former  members  of  the  SRI  vision  group 
who  contributed  to  the  IU  Testbed  effort  Include 
Dr.  H.  C.  Barrow,  Dr.  L.  Qusa,  Dr.  J.  M.  Tenenbaum, 
and  Dr.  A.  P.  Vitkin. 
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Table  1 


INST. 

SUMMARY  OF  CONTRIBUTIONS 

CONTRIBUTION 

LANGUAGE 

SRI 

Road  expert 

MAINSAIL 

RANSAC 

MAINSAIL 

CMU 

Picture  access  and 

C 

display  packages 
PHOENIX  segmentation 

C 

system 

Stereo/correlation  system 

C 

STANFORD 

ACRONYM  3-D  model-based 

vision  system 

FRANZLISP 

MARYLAND 

Relaxation  package 

C 

ROCHESTER 

Generalized  Hough 

C 

MIT 

transform  system 

Stereo  reconstruction 

LISP  MACHINE 

system 

LISP 

use 

Linear  feature  analysis 

C 

Laws  texture  analysis 

SAIL 

(C  planned) 

The  following  subsections  summarize  the  status 

of  each  of  the  currently  Integrated  contributions. 

1. 

Carnegie-Mellon  University 

Contributions 

to  the  Testbed 

(1)  CMU  Grlnnell  Graphics  and  Inage  Packages. 


*  Date  received:  14  August  1981. 

*  Responsible  party:  David  McKeown. 

*  Language:  C  (Berkeley  Version  7  UNIX) 
running  on  the  VAX. 

*  Description:  These  packages  provide 
basic  access  to  the  functions  of  the 
Grlnnell  display  system,  as  well  as 
the  capability  of  accessing  laage  data 
files. 

*  Remarks:  A  maker  of  minor 

modifications  were  needed  to  aake  the 
CMU  package  work  with  our  specific 
Grlnnell  configuration.  The  present 
code  will  support  any  CMU 
configuration  or  the  SRI  Testbed 
configuration.  The  CMU  Image  access 
package  has  also  been  Integrated  Into 
the  testbed  environment!  a  new, 
extended  Testbed  picture  format  has 
been  implemented. 


(2)  PHOENIX  Segmentation  Package 

*  Date  received:  December  1981. 

*  Responsible  party:  Steve  Shafer. 

*  Language:  C  (Berkeley  Version  7  UNIX) 
running  on  the  VAX. 

*  Description:  This  is  a  segmentation 
package  that  uses  the  Ohlander 
histogram-partitioning  method  to 
segment  color  Imagery.  Each  pixel  In 
the  Input  Image  Is  assigned  a  segment 
Identification  label  according  to  the 
Image  characteristics  and  the 
parameters  selected.  Segmentation  Is 
carried  out  hierarchically,  with 
higher-level  regions  Isolated  and 
segmented  separately  Into  sub-regions. 
Segmentation  ceases  in  a  given  region 
when  the  program  criteria  for 
significance  of  the  next  level  of 
segmentation  have  not  been  met. 

*  Remarks:  This  system  has  a 

sophisticated  user  Interface  and  a 
very  useful  checkpoint  system. 

(3)  Stereo  Reconstruction  and  Correlation 

Package 

*  Date  received:  28  September  1981. 

*  Responsible  party:  Charles  Thorpe. 

*  Language:  C  (Berkeley  Version  7  UNIX) 
running  on  the  VAX. 

*  Description:  This  Is  a  C  version  of 
the  Moravec  correlation  and  stereo 
reconstruction  package  written 
originally  In  SAIL  at  Stanford.  The 
package  consists  of  two  portions: 
CORRELATE,  which  selects  a  set  of 
"Interesting"  points  In  one  Image, 
using  the  Moravec  Interest  operator, 
and  attempts  to  locate  the 
corresponding  points  In  a  second 
Image,  using  an  efficient  hierarchical 
correlation  matcher;  STEREO,  which 
uses  the  same  method  as  CORRELATE  to 
find  corresponding  points  In  a  series 
of  up  to  9  Images,  and  then  employs 
the  Moravec  method  to  assign  a  stereo 
depth  value  and  confidence  level  to 
each  member  of  the  set  of  Interest 
points. 

*  Remarks:  This  package  Implements  all 
the  basic  capabilities  of  the  original 
Moravec  SAIL  system,  plus  a  number  of 
enhancements  Introduced  by  Charles 
Thorpe. 

2.  University  of  Maryland  contributions  to 

TKs  testbed 

Relaxation  Package 

*  Date  received:  Final  version  received  9 
July  1981. 


344 


*  Responsible  party:  Bob  Kirby  (author: 
Russell  Smith,  revised  by  Joe  Pallas). 

*  Language:  C  (Berkeley  Version  7  UNIX) 
running  on  the  VAX. 

*  Description:  This  relaxation  package  takes 

an  Initial  set  of  probabilities  that  a 
pixel  belongs  to  each  of  a  set  of  classes 
and  Iteratively  adjusts  then  according  to 
the  class  probabilities  of  neighboring 
pixels.  Two  options  are  provided:  a 
Hunnel-Zucker-Rosenfeld  relaxation 

algorithm  and  a  Peleg  relaxation  algorlthn. 
For  the  two-class  case,  the  following  steps 
are  executed:  first,  a  simple  algorithm  is 
used  to  generate  probability  assignments 
from  the  luminance  values  in  an  Image;  then 
the  relaxation  program  Is  used  to  produce  a 
new  assignment  of  probabilities  for  each 
pixel;  finally,  an  inverse  algorithm 
aenerates  a  luminance  representation 
corresponding  to  the  reassigned  pixel 
probabilities;  the  resultant  grey  scale 
Image  can  be  displayed  for  the  user  to 
monitor  the  progress  of  the  relaxation 
process. 

*  Remarks:  A  multiclass  method  of  generating 
probability  assignments  corresponding  to 
luminance  valuer  has  been  added  for  test 
and  demonstration  purposes. 

3.  MIT  contributions  to  the  Testbed 
Marr-Poggfo-Grimson  Stereo  System 

*  Delivery  awaits  the  Installation  of  the 
Testbed  Llap  Machines. 

*  Responsible  parties:  Mike  Brady,  Eric 
Grlmson  and  Keith  Nlshlhara. 

*  Language:  Lisp  Machine  LISP. 

*  Description:  This  system  uses  zero-crossing 
matches  at  several  scales  to  compute 
disparity  values  between  stereo  pairs. 
Interpolation  of  the  three-dimensional 
surfaces  between  matched  zero-crossings  Is 
done  with  Grlmson's  Interpolation  method. 

*  Remarks:  To  run  efficiently,  this  system 
should  have  special  hardware  (a  convolution 
box)  added  on  to  the  Lisp  Machine. 

*  Plana:  This  system  will  be  integrated  with 
the  Testbed  when  Lisp  Machines  are 
Incorporated  into  the  environment.  If 
possible  a  convolution  box  will  be 
Incorporated  Into  the  Llap  Machines. 

4.  University  of  Rochester  Contributions  to 
the  Testbed 

Hough  Transform  Package 

*  Date  received:  19  May  1981. 

*  Responsible  parties:  Dana  Ballard  and  Bill 
Lampeter. 


*  Language:  C  (Berkeley  Version  7  UNIX) 
running  on  the  VAX. 

*  Description:  This  program  takes  a  geometric 
shape  template  and  attempts  to  find 
matching  shapes  In  the  Image  using  the 
generalized  Hough  transform  technique.  The 
matched  shapes  may  differ  In  displacement, 
rotation,  and  acale  from  the  supplied  shape 
template.  The  most  likely  values  of 
location,  rotation  angle,  and  scale  are 
output  by  the  program  and  the  reoriented 
templates  are  displayed  over  the  Image. 

*  Remarks:  The  CMU  graphics  package  has  been 
used  as  a  basis  for  incorporating  full 
Interactive  graphics  Into  this  system  for 
both  template  generation  and  picture 
processing.  Several  Improvements  have  been 
made  In  the  user  Interface  and  In  the 
efficiency  of  the  code,  and  the  package  was 
extended  to  handle  multiple  Instances  of  an 
object. 

5.  SRI  Contributions  to  the  Testbed 

(1)  Road  Expert 

*  Date  received:  Approximately  1  Jan 
1981. 

*  Responsible  party:  Helen  Wolf  (Lynn 
Quam  available  as  consultant). 

*  Language:  MAINSAIL  running  under 

EUNICE  on  the  VAX. 

*  Description:  This  package  acquires  and 
tracks  linear  features  such  as  roads 
in  aerial  Imagery.  Tracking  is  done 
automatically  In  Imagery  with  a  known 
ground  trnth  data  base.  Once  a  road 
Is  Identified  and  tracked,  a  separate 
subsystem  Is  available  to  analyze  road 
surface  anomalies  and  to  place  them 
Into  categories  such  as  vehicles,  road 
surface  markings,  and  shadows. 

(2)  RANSAC  Image-to-Data-Base  Correspondence 

Package 

*  Date  received:  Approximately  1  Jan 
1981. 

*  Responsible  parties:  Martin  Flschler 
and  Bob  Bolles. 

*  Language:  MAINSAIL  running  under 
EUNICE  on  the  VAX. 

*  Description:  This  package  selects  a 
best  fit  to  an  array  of  control  points 
possibly  containing  gross  errors. 
This  method  offers  significant 
Improvements  over  least-squares 
fitting  technique*  If  gross  errors  are 
present.  A  typical  application  la  to 
compute  the  correspond^  camera 
model,  given  a  set  of  landmarks  In 
aerial  Imagery. 
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6.  Stanford  University  Contributions  to  the 

Testbed 

ACRONYM  System 

*  Date  received:  IS  March  1982. 

*  Responsible  parties:  Tom  Blnford  and  Rod 
Brooks . 

*  Language:  FRANZLISP  running  on  the  VAX.  An 
extensive  macro  package  is  used  to  preserve 
most  of  the  original  MACLISP  code. 

*  Description:  ACRONYM  takes  a  scene  that  has 
been  reduced  to  a  set  of  two-dimensional 
ribbons  and  searches  for  instances  of 
three-dimensional  models  that  have  been 
supplied  to  the  system  as  data.  This  Is  a 
rule  based  system  that  allows  great 
flexibility  In  the  Interpretation  and 
scene-prediction  process.  Models  can  also 
be  defined  In  a  very  general  manner  by 
using  generalized  cones,  constraints,  and 
subclass  definitions. 

*  Remarks:  Reduction  of  an  Image  to  a  list  of 
ribbons  must  now  be  done  by  hand,  starting 
with  a  corresponding  file  of  line  segments 
generated  by  a  program  such  as  the  Nevatla- 
Babu  line  finder.  While  some  test  Imagery 
la  available  with  the  ribbon  reduction 
already  carried  out,  the  Testbed  ACRONYM 
system  would  profit  from  the  addition  of  an 
automated  ribbon  reduction  module. 

7.  University  of  Southern  California 

Contributions  to  the  Testbed 

(1)  Nevatla-r ibu  Line  Plnder 

*  Daf  received:  1  June  1981  (SAIL 

ve  slon);  14  June  1982  (C  version  from 
Hughes  Aircraft). 

*  Responsible  party:  Ram  Nevatla  at  USC 
and  Julius  Bogdanovich  at  Hughes 
Aircraft. 

*  Language:  C  (Berkeley  Version  7  UNIX) 
running  on  the  VAX. 

*  Description:  This  package  extracts 
linear  features  from  an  Image  and 
produces  a  data  base  of  lines.  The 
Testbed  C  version  supports  S  x  S 
convolution  masks  configured  to 
Identify  edges  oriented  at  30-degree 
Intervals.  The  edges  are  then 
thresholded  and  linked  together  Into 
lines. 

*  Remarks:  The  C  version  of  this  package 
lacks  some  of  the  parallel-line  and 
snpersagmei  c  extraction  features  of 
the  SAIL  version.  This  makes  the  C 
version  lass  useful  for  extracting 
larger  linear  features  with 
distinguishable  edges.  It  would  also 
be  desirable  to  add  support  for  using 
a  variety  of  convolution  masks.  He 


anticipate  that  these  features  will 
eventually  be  added  to  the  package. 

(2)  Laws  Texture  Analysis 

*  Date  received:  6  July  1981. 

*  Responsible  party:  Ken  Lairs. 

*  Language:  SAIL  running  under  TOPS-20 
on  the  PDP-10. 

*  Description:  This  package  segments  an 
Image  by  Identifying  textured  regions. 
It  Is  currently  configured  as  a  batch 
program  and  uses  classification 
coeffldenta  developed  for  the 
particular  set  of  textures  In  Laws' 
doctoral  thesis. 

*  Remarks:  This  program  now  runs  on  SRI- 
AI.  Substantial  work  will  be  required 
to  make  It  Interact  with  the  Testbed 
VAX  environment.  He  hope  to  be  able 
to  recode  this  package  In  C  and  to 
Improve  Its  capabilities  somewhat. 

B.  Demonstration,  Teat,  and  Evaluation  Procedures 

Evaluation  of  software  modules  on  the  Testbed 
has  been  conducted  at  several  levels,  depending  on 
the  particular  system  In  question.  Each  module 
supports  a  standard  demonstration  of  Its 
capabilities.  The  degree  to  which  testing  and 

evaluation  can  be  carried  out  meaningfully  depends 
on  the  flexibility  of  each  Individual  program. 
Some  can  run  on  completely  arbitrary  Images,  while 
others  require  extensive  supporting  data  that 
cannot  be  easily  assembled  for  arbitrary  lmoges. 
Furthermore,  some  contributions  have  been 

extensively  documented  In  existing  literature, 
while  others  have  required  additional  modifications 
and  docuaentatlon  regarding  their  operation  In  the 
Testbed  environment.  This  diversity  of 

clrciaistances  has  led  us  to  divide  the 

contributions  Into  several  groups  according  to  the 
type  and  extent  of  the  evaluation  and  testing  to  be 
performed.  These  categories  are  described  In  the 
following  paragraphs: 

(1)  DESCRIPTION  ONLY. 

Several  contributions  cannot  be  demonstrated 

interactively  on  the  IU  Testbed  VAX 
system,  but  can  be  run  only  on  the  DEC 
2060  or  on  special  hardware  such  as  Lisp 
Machines.  Since  such  contributions  are 

not  part  of  the  integrated  Testbed 
system,  their  capabilities  will  be 
described  but  not  evaluated.  The 

docuaentatlon  on  these  modules  will  cover 
the  following  subjects: 

*  General  description  of  the  nodule  and 
Its  scientific  context. 

*  Scientific  principles  of  operation  of 
the  algorithm. 

*  Raferancas  and  bibliography. 
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(2)  DEMONSTRATION  AND  REMARKS. 

Several  major  stand-alone  systems  need 
specially  tailored  data  basea  to  function 
correctly.  When  tools  for  construction 
of  such  data  bases  are  not  available,  the 
modules  will  run  only  on  a  limited  set  of 
Images,  thus  restricting  the  nature  of 
the  evaluation  that  can  be  carried  out. 
Such  systems  will  not  be  systematically 
evaluated  on  large  numbers  of  Images 
because  of  the  operational  difficulties 
of  setting  up  the  required  contexts. 
These  systems  will  be  available  for 
demonstration  on  limited  data  sets. 

Evaluation  Issues  relevant  to  these 
contributions  will  be  treated  In 
documentation  covering  the  following 
topics: 

*  General  description  of  the  module  and 
Its  scientific  context. 

*  Scientific  principles  of  operation  of 
the  algorithm. 

*  Program  user  documentation. 

*  Suggestions  for  modifications. 

*  References  and  bibliography. 

(3)  DETAILED  EVALUATION. 

Those  modules  that  can  readily  be  exercised  on 
a  wide  variety  of  Imagery  In  the  Testbed 
Image  library  will  be  subject  to  rigorous 
and  detailed  Investigation.  In  addition 
to  the  Information  provided  for  other 
types  of  modules,  we  shall  supply  a 
thorough  evaluation  report  on  the 
parameters,  performance,  strengths,  and 
weaknesses  of  the  module.  The  de ’-ailed 
evaluation  report  will  Include  the 
following: 

*  General  description  of  the  module  and 
Its  scientific  context. 

*  Scientific  principles  of  operation  of 
the  algorithm. 

*  Program  user  documentation. 

*  Evaluation  of  performance,  strengths, 
and  weaknesses. 

*  Suggestions  for  modifications. 

*  Raferencas  and  bibliography. 

Ws  present  In  the  following  table  tha  level  of 
the  evaluation  procedure  which  Is  currantly  planned 
for  each  of  tha  contributed  Testbed  software 
modules. 


Table  2 

LEVEL  OF  EVALUATION  FOR  EACH  CONTRIBUTION 

CONTRIBUTION  TYPE  OF  EVALUATION 

PLANNED 


SRI 

Road  Expert 
RANSAC 

CMU 

Display  package 
PHOENIX 

Stereo/Correlation 


STANFORD 

ACRONYM 


MARYLAND 

Relaxation 


ROCHESTER 

Hough  Transform 


DEMONSTRATION  AND  REMARKS 
DEMONSTRATION  AND  REMARKS 


DESCRIPTION  ONLY 
DETAILED  EVALUATION 
DETAILED  EVALUATION 


DEMONSTRATION  AND  REMARKS 


DETAILED  EVALUATION 


DETAILED  EVALUATION 


MIT 

Stereo  (Lisp  Machine)  DESCRIPTION  ONLY 


use 

Linear  Features  DESCRIPTION  ONLY 

Texture  Analjala  DESCRIPTION  ONLY 


C.  Summitry  of  Evaluation  Results 

Major  evaluation  efforts  are  In  progress  at 
this  time  on  the  following  nodules: 

*  CHOUGH  generalised  Hough  transform  shape¬ 
finding  system. 

*  PHOENIX  segmentation  system. 

*  STEREO/ CORRELATE  stereo  reconstruction 

system. 

*  RELAX  pixel-level  relaxation  system. 

In  the  evalustlon  process,  we  have  attempted 
to  uncover  characteristics  of  each  system  which 
become  obvious  to  the  user  only  through  extensive 
experimentation.  While  the  final  results  of  each 
of  these  evaluation  studies  are  not  available  at 
this  time,  we  present  below  some  of  the  salient 
features  of  the  Intermediate  results.  Full  written 
evaluation  reports  on  each  of  the  above  modules 
will  be  available  soon. 


347 


1.  CHOUGH 

GHOUGH  uses  the  generalized  Hough 
transform  method  to  find  Instances  of  a  predefined 
template  shape  In  an  Image.  It  allows  the 
determination  of  the  location,  scale,  and  angular 
rotation  of  the  target  object.  The  system  has  also 
been  extended  somewhat  to  detect  multiple  Instances 
of  a  shape  In  a  single  image. 

The  following  templates  were  used  In 
testing  the  program:  a  lake,  a  right  angle,  a 
circle,  and  an  ellipse.  Several  Interesting 
artifacts  of  the  template  parametrlzatlon  were 
observed.  An  example  was  the  quantization  of 
template  angles  resulting  from  the  use  of  discrete 
lattice  points  to  compute  the  orientation  of  line 
segments  In  the  template.  Very  dense  templates 
generated  excessive  noise  compared  to  sparser 
outlines  due  to  the  fact  that  neighboring  pixels 
were  related  only  by  angles  which  were  multiples  of 
45  degrees.  This  significantly  Increases  the 
observed  noise  In  the  estimated  object  parameters. 
Several  variations  of  the  Implementation  strategy 
have  been  noted  which  would  reduce  such  effects. 

Other  significant  characteristics  of  the 
algorithm  were  noted  while  attempting  to  locate 
multiple  instances  of  circular  or  elliptical 
storage  tanks  In  a  variety  of  aerial  Imagery.  A 
profound  advantage  of  the  Hough  method  was  found  to 
lie  In  Its  ability  to  discern  Incomplete  and 
occluded  shapes.  On  the  other  hand,  no  single 
choice  of  parameters  would  serve  to  locate 
accurately  each  and  every  one  of  the  circular  tanks 
obvious  to  the  human  observer;  the  blurred  nature 
of  some  of  the  photometry  and  other  characteristics 
of  the  tanks  (e.g.,  rounded  tops  and  shadows) 
required  that  special  choices  of  parameters  and 
templates  be  made  In  order  to  detect  any  individual 
tank.  Thus  GHOUGH  was  found  to  be  very  useful  for 
detecting  unique,  photometrically  distinguished 
shapes  or  partial  shapes,  but  needed  higher  level 
Information  to  determine  effective  parameter 
choices  when  less  distinctive  Imagery  was 
available. 

2.  PHOENIX 

PHOENIX  Is  an  Ohlander-style  segmentation 
package  that  uses  histogram  analysis  to  carry  out  a 
hierarchical  segmentation  of  color  Imagery.  A 
number  of  options  are  available  to  control  the 
number  and  type  of  the  segmentation  cuts  performed 
on  each  histogram  as  well  as  to  select  criteria  for 
determining  the  significance  of  the  segmentation. 
Noise  point  merging  to  smooth  out  the  segments  is 
also  supported. 

The  user  Interfsce  for  PHOENIX  Is  based 
on  the  CMU  Cl  command  driver,  which  allows  s  wide 
variety  of  subroutines  to  be  called  In  an 
Interactive  and  user-controlled  manner. 
Information  about  each  segment  of  a  processed  image 
can  be  printed  and/or  displayed  on  the  graphics 
system  as  desired.  A  variety  of  switches  and  flags 
are  available  to  control  graphics  and  other  output 
from  the  program.  A  particularly  useful  feature  is 


the  availability  of  a  checkpoint  system  that  can 
save  the  current  state  of  a  segmentation  process 
and  read  It  back  In  at  a  later  date  for  more 
detailed  examination  or  additional  processing. 

A  number  of  fundamental  properties  of  the 
PHOENIX  system  have  been  noted  so  far  In  the 
evaluation  process.  The  best  performance  Is 
obtained  for  color  Imagery  In  which  objects  of 
Interest  have  distinct  colors  and  for  which  the 
histograms  of  one  or  more  spectral  bands  have  at 
least  two  distinct  peaks.  Significant  region 
identification  in  deeper  levels  of  the  hierarchical 
process  also  relies  on  the  existence  of  more  than 
one  distinct  peak  In  the  histograms  of  the  parent 
regions.  Textured  monochrome  Images  often  lack 
these  characteristics. 

In  some  Instances,  reasonable  parameter 
choices  fall  to  produce  the  Intended  results;  for 
example.  In  one  test  image  of  a  city  skyline,  the 
sky  Itself  Is  segmented  Into  dozens  of  barely 
distinguishable  subareas,  while  an  obviously 
colorful  American  flag  Is  never  distinguished  from 
the  sky  at  any  level  of  the  hierarchical 
segmentation  process.  On  the  basis  of  the 
characteristics  of  PHOENIX  noted  In  the  previous 
paragraph,  one  might  conjecture  that  certain  types 
of  color  imagery  that  humans  can  segment  well  would 
not  be  well-suited  to  PHOENIX.  He  have  confirmed 
experimentally  that  histogram-equalized  color 
Imagery  Is  easily  segmented  by  humans  but  not  by 
PHOENIX;  the  same  phenomenon  is  observed  for 
Imagery  with  significantly  different  texture 
regions  If  the  regions  have  relatively  smooth 
histograms. 

PHOENIX  can  be  utilized  profitably  on 
imagery  which  supports  color-based  region 
identification  If  the  Image  digitization  has  a  rich 
histogram  structure.  Presumably,  certain  types  of 
preprocessing  could  also  be  performed  on  some 
images  to  adapt  them  to  the  PHOENIX  domain;  In 
particular,  some  fairly  straightforward 
transfomations  of  the  color  space  (not  supported 
on  the  Testbed  version  of  PHOENIX)  would  he 
expected  to  have  significant  effects.  Given 
appropriate  Imagery,  one  could  then  use  the  output 
of  PHOENIX  for  higher-level  tasks  requiring  Image 
segmentation  Information. 

3.  STEREO/CORRELATE 

The  STEREO/CORRELATE  package  is  an  image 
correlation  system  based  on  the  Moravec  Interest 
operator.  Moravec's  high-performance  hierarchical 
moving -window  correlation  scheme  Is  used  to  match 
interest  points  In  successive  images.  Conventional 
stereo  reconstruction  algorithms  are  then  used  to 
compute  e  possible  stereo  depth  assignment  for  sets 
of  matched  points. 

The  correlation  algorithm  works 
essentially  aa  follows.  A  set  of  points  exceeding 
a  given  threshold  value  for  the  Moravec  Interest 
operator  is  found  In  one  Image.  The  Images 
themselves  are  all  reduced  by  pixel  averaging  to 
lower  resolutloaa  such  as  1/2,  1/4,  1/8,  and  1/16. 
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The  correlation  operator  then  tries  to  find  the 
best  natch  In  each  of  the  other  laages  to  a  small 
window  around  the  Interest  point  In  the  lowest- 
resolutlon  version  of  the  original  Image.  Searches 
at  the  next  higher  resolution  are  limited  to  a 
window  around  the  best  match.  Then  the  process  Is 
Iterated  until  a  single  pixel  In  each  of  the  full 
resolution  Images  Is  Identified  as  the  best  match. 

This  process  has  the  obvious  advantage  of 
reducing  the  amount  of  computation  required  and 
Improving  real-time  performance.  In  typical 
experiments  with  the  system,  one  occasionally  finds 
that  the  search  window  wanders  away  from  the  true 
matching  points  and  that  the  search  window  at  the 
highest  resolution  level  may  not  contain  the 
desired  point.  Also,  periodic  structures  such  as 
wall  tiles  and  building  windows  Introduce 
systematic  mismatching  of  similar  Interest  points. 
We  note  that  the  Introduction  of  some  simple  higher 
level  guidance  could  eliminate  most  such  problems 
with  the  STEREO/ CORRELATE  system. 

STEREO/CORRELATE  functions  very  well  when 
used  to  construct  a  sparse  depth  map  of  Imagery 
that  contains  few  confusing  or  periodic  structures; 
the  addition  of  some  external  guidance  would  extend 
Its  utility  to  extremely  broad  classes  of  Imagery. 
Among  the  major  advantages  of  the  system  are  that 
It  has  very  good  real-time  performance  and  behaves 
well  even  when  applied  to  poor  Image  data. 
Ultimately  one  would  like  to  use  the  system  to 
construct  more  detailed  stereo  depth  maps  of  an 
Image;  to  accomplish  this  objective,  one  would  need 
to  add  a  powerful  Interpolation  system  to  convert 
the  sparse  Interest-point  map  Into  pixel-by-pixel 
depth  assignments. 

4.  RELAX 

RELAX  Is  a  package  that  supports  both  the 
Hummel-Zucker-Rosenfeld  and  Peleg  pixel-level 
relaxation  algorithms.  To  use  the  relaxation 
technique,  dne  first  runs  a  utility  that  converts  a 
photometric  Image  Into  a  matrix  of  probabilities 
that  given  pixels  belong  to  postulated  categories. 
A  set  of  compatibility  coefficients  Is  then 
computed  to  support  the  relaxation  computations. 
Finally,  one  performs  any  desired  number  of 
iterations  to  yield  a  new  set  of  probability 
assignments;  If  desired,  a  revised  grey-scale  Image 
can  be  computed  from  the  probability  assignments 
and  displayed. 

The  various  steps  In  the  application  of 
the  RELAX  package  have  been  Integrated  Into  a 
flexible  user  system  based  on  the  SRI  ICP 
Interactive  command  processor.  We  note  that  two- 
category  relaxations  allow  fairly  straightforward 
conversion  between  the  imagery  and  probability 
structures;  three  or  more  categories  require 
methods  that  are  increasingly  arbitrary.  One 
solution  to  this  problem  would  be  an  Interactive 
utility  to  aid  the  user  In  assigning  category 
probabilities. 

This  system's  ability  to  Improve  noisy 
Imagery  and  to  aid  In  the  extraction  of  an  Image 


signal  depends  strongly  on  the  nature  of  the 
Initial  data  and  the  probability  assignments.  The 
current  systea  works  most  robustly  on  Images  with  a 
two-category  Interpretation,  and  hence  Is  beat  used 
on  Image  segments  with  a  single  bright  signal  area 
against  a  dark  background  (or  vice  versa).  The 
moat  effective  way  to  use  this  system  would  be 
first  to  Identify  a  subarea  containing  only  one 
object  of  Interest  against  a  bland  background,  then 
to  run  RELAX  to  Improve  the  signal.  Alternatively, 
one  could  use  an  application-dependent  preprocessor 
to  assign  probabilities  based  on  criteria  more 
complex  than  the  values  of  single  pixels.  This 
system  produces  excellent  results  If  sufficient 
Information  Is  available  for  a  meaningful 
assignment  of  category  probabilities  to  the  pixels 
of  the  original  Image. 


Ill  TRANSPORTABLE  FEATURES  OF  THE 
TESTBEU  ENVIRONMENT 


One  of  the  objectives  of  the  Testbed  program 
has  been  to  lay  the  foundations  for  a  system  that 
would  be  In  some  sense  transportable  to  other 
similar  research  environments.  This 
transportability  would  allow  other  sites  to  make 
use  of  existing  Testbed  code  without  having  to 
develop  their  own  version;  It  would  also  make  It 
possible  for  other  sites  to  carry  out  their  own 
evaluations  and  Improvements  of  basic  Testbed 
contributions  to  meet  their  specific  needs. 

These  objectives  have  been  partially  met. 
Each  contribution  to  the  Testbed  system  can  be 
tested  and  demonstrated  with  minimal  modifications 
on  UNIX  or  EUNICE/VMS  VAX  systems  with  Grlnnell 
display  devices.  Many  utilities  have  been  acquired 
from  contributing  sites  or  developed  locally  by 
Testbed  personnel.  A  picture  library  and  a  simple 
systea  for  accessing  It  have  been  developed.  A  new 
and  general  Testbed  Image  file  format  has  been 
created  that  supports  all  of  the  Image  types  we 
have  found  useful  In  Integrating  contributed 
software.  A  modified  version  of  the  CMU  Image 
access  package  supports  all  essential  Image 
retrieval  and  access  functions. 

There  are  also  several  desirable  objectives 
that  remain  to  be  achieved  at  this  time.  For 
example,  graphics  and  Image  display  on  the  Testbed 
are  supported  entirely  by  an  extension  of  the  CMU 
Grlnnell  display  package.  This  Is  a  large  body  of 
software  whose  existence  allowed  basic  Testbed 
demonstration  and  testing  objectives  to  be  met  in  a 
timely  fashion.  However,  the  package  Is  manifestly 
device-dependent,  and  so  each  of  the  application 
programs  carries  with  It  the  device  dependence 
Inherent  In  using  the  Grlnnell  display  package. 

Ultimately,  we  would  like  to  adopt  a  uniform 
device-independent  graphics  standsrd  to  support  the 
Testbed  demonstrations  on  whatever  devices  happen 
to  be  available  at  a  particular  slta.  This  goal 
has  been  hampered  by  the  well-known  deficiencies  of 


349 


the  SIGGRAPH  standard  In  the  area  of  raster 
graphics  as  well  as  the  unsettled  status  of 
proposed  extensions  and  substitutions  for  the 
SIGGRAPH  format;  In  particular,  the  ANSI  standards 
committee  may  soon  decide  to  endorse  an  entirely 
different  standard,  A  device-independent  system 
will  soon  be  Implemented  on  the  Testbed  for  use  In 
the  SRI  research  program,  but  the  exact  choice  of 
format  Is  still  being  debated. 

Another  objective  Is  the  establishment  of  a 
standard  set  of  utilities  for  registering  multiple 
linages  to  a  ground  truth  data  base.  Some  progress 
has  been  made  In  this  direction  by  the  SRI  RANSAC 
system  and  some  supporting  modules  which  are 
currently  being  converted  Into  C,  and  by  the  CMU 
“Browse”  system  (which  Is  not  yet  ready  for 
transport  to  the  Testbed).  Further  systematization 
of  such  Image  generation  data  as  time  of  day, 
lighting  characteristics,  photometric  parameters 
and  camera  characteristics  would  also  be  desirable. 
The  systematic  application  of  IU  techniques  to 
cartographic  problems  requires  that  such 
Information  be  available  for  all  Imagery  intended 
for  use  as  source  data. 

In  the  following  subsections,  we  present  a 
summary  of  the  basic  capabilities  which  are 
supported  In  the  Testbed  system  and  are  potentially 
transportable  to  copies  of  the  Testbed  system. 

A.  Utility  Programs 

Among  the  generally  useful  utility  programs 
available  on  the  Testbed  are  the  following: 

(1)  Cl.  This  is  a  command  Interpreter 

contributed  by  CMU.  It  allows  one  to 
link  a  variety  of  subroutines  Into  a  top- 
level  command  processor  and  to  Invoke  the 
subroutines  with  arguments  provided 
Interactively  by  the  user.  Extensive 
help  and  utility  facilities  are 
supported. 

(2)  I CP.  This  Is  a  command  Interpreter  for 

the  C  language  contributed  by  SRI.  It  Is 
very  similar  to  Cl,  except  that  Its 
treatment  of  arguments  and  local 
variables  Is  more  general.  ICP,  for 
example.  Is  able  to  Invoke  system  or  user 
subroutines  directly,  while  Cl  must  have 
an  argument-parsing  Interface  written  for 
each  routine. 

(3)  DOC.  This  Is  a  CMU  utility  for 
generating  program  documentation  (UNIX 
“man”  entries)  without  needing  to  know 
any  details  of  the  TROFF  phototypesetting 
system.  All  information  that  the  program 
needs  to  generate  a  syntactically  correct 
“man*  entry  Is  obtained  by  Interrogating 
the  user. 

(A)  NORMALIZE.  This  CMU  routine  normalises  a 
greyscale  Image  to  produce  a  new  output 
image  with  desired  compression  or 
clipping.  SRI  modifications  allow  gray 
scale  stretching  as  well. 


(5)  REDUCE.  This  CMU  routine  extracts  a 

subwindow  of  an  Image,  or  rescales  an 
Image  by  an  Integer  sampling  factor. 

(6)  SHAPEUP.  The  original  CMU  routine  of 

this  name  has  been  entirely  rewritten  to 
support  conversions  among  many  Image 

formats. 

(7)  IMGSYS.  This  Is  a  Testbed  system  that 

allows  a  variety  of  picture  files  to  be 

sccessed,  described,  and  displayed  In  a 
desired  format  of  subwindows  on  the 
display  system. 

B.  User  Interface  Systems 

The  following  systems  permit  useful 
information  or  features  to  be  made  available  to 
users  of  the  Testbed: 

(1)  TESTBED  DEMO  DIRECTORIES.  A  collection 
of  complete  demonstration  facilities  have 
been  set  up  In  the  Testbed  demonstration 
directory.  Each  of  the  demonstratable  C- 
coded  contributions  Is  represented  by  a 
series  of  subdirectories  supporting 
various  Informative  demonstratlona  of  the 
program  capabilities.  Ground  truth  data 
for  comparison  with  the  program  output  Is 
also  available  In  some  cases.  The 
command  files  supplied  In  the 
demonstration  directories  provide 
detailed  examples  of  program  Invocation; 
from  these  examples,  a  sophisticated  user 
can  in  principle  deduce  the  fundamental 
operating  procedures  for  each  program. 
He  note  that  detailed  written 
documentation  of  program  usage  will  be 
available  In  the  evaluation  reports  for 
each  contribution. 

(2)  VAX  EMACS  INFO.  An  INFO  macro  package 
has  been  developed  at  the  SRI  Testbed  to 
support  an  extended  version  of  the  TECO 
EMACS  INFO  system.  This  system  Is  a 
chain-linked  documentation  reading  and 
generation  system  that  utilizes  the  basic 
window-oriented  features  of  the  EMACS 
editor  to  access,  search,  and  display 
text  Information.  On-line  Testbed 
documentation  Is  available  via  the  INFO 
system.  This  provides  a  well-structured 
and  convenient  access  mechanism  to  '.the 
on-line  documentation  of  the  system's 
functions  and  capabilities. 

(3)  LEDIT  and  LTAGS.  An  Intercommunicating 
pair  of  special  modifications  of  EMACS 
and  FRANZLISP  have  been  Implemented  on 
the  SRI  EUNICE/VMS  system  to  support 
Llsp-Machlne-llke  capabilities  for 
developing  FRANZLISP  programs.  LEDIT 
allows  the  user  to  copy  any  defined 
function  from  a  FRANZLISP  Image  into  an 
EMACS  editor  buffer,  modify  it,  and 
reload  it  into  the  FRANZLISP  process 
without  changing  any  other  part  of  the 


FRANZLISP  environment .  Files  with  aany 
functions  can  be  edited  In  EMACS  and  the 
functions  of  Interest  aarked  for  loading 
when  the  user  returns  to  FRANZLISP.  The 
LTAGS  package  works  In  concert  with  LEDIT 
In  EMACS,  allowing  the  user  to  display 
any  desired  function  In  his  window  for 
editing  by  slaply  giving  the  first  few 
characters  of  the  function  name;  the 
systea  automatically  keeps  track  of  which 
flies  contain  which  functions,  so  that 
such  complications  are  Invisible  to  the 
user.  We  note  that  the  systea  service 
capabilities  needed  to  support  the 
Intercommunications  Involved  In  LFDIT  are 
not  now  available  on  UNIX,  and  so  require 
the  VMS  operating  systea. 

(4)  ASKLIB  and  ARGLIB.  This  Is  a  set  of 
utility  routines  available  for 
application  programs  that  need  to 
interrogate  the  user  about  the  values  of 
program  parameters.  ASKLIB  has  been 
slightly  extended  from  the  original  CMU 
package,  while  ARGLIB  has  been  entirely 
rewritten  for  the  Testbed. 

(5)  ERR.H  error  package.  This  Is  a  Testbed 
package  that  supports  flexible  and  user- 
friendly  reporting  and  handling  of  error 
conditions. 

C.  Picture  Data  Base  System 

The  Testbed  Picture  Data  Base  System  (PICDBMS) 
la  a  FRANZLISP-based  system  that  Interacts  with  a 
directory  of  test  Imagery  to  allow  the  entry  and 
retrieval  of  Image  characteristics  from  an  Image 
data  file.  Following  the  CMU  picture  file 
conventions,  each  Image  Is  assigned  a  named 
directory,  e.g.,  /lu/tb/pic/chalr,  which  contains 
the  picture  data,  e.g.,  Ared.lmg,  4blue.lmg, 
4green.lmg. ,  along  with  collateral  data  files. 
PICDBMS  contains  utilities  for  creating  or  editing 
a  "pic.dat"  file  In  each  picture  directory.  This 
data  file,  containing  data  formatted  for  easy  LISP 
readability.  Includes  picture  descriptions,  picture 
characteristics,  and  a  list  of  data  base  keys. 
Typical  data  base  keys  that  are  currently  supported 
Include  the  words  listed  In  parentheses  below: 

*  IMAGE  TYPE  AND  MULTIPLICITY:  (  bw  color 
stereo  multiple) 

*  SCENE  DOMAIN  TYPE:  (indoor  cultural 

natural) 

*  CONTENT  CHARACTERISTICS:  (point  linear 

area) 

*  VIEWPOINT:  (aerial  ground). 

Other  types  of  data  can  be  supported  as  the  need 
arises.  Sets  of  Images  can  be  retrieved  by  asking 
for  Images  corresponding  to  a  set  of  keys;  both  AND 
and  OR  conditions  are  supported  In  the  data  base 
key  Interrogation. 

Additional  facilities  of  PICDBMS  Include  a 
browsing  utility  to  display  Hats  of  Images 


provided  by  the  keyed  data  base  retrieval 
subsystem.  Images  too  large  to  fit  on  one  Grlnnell 
screen  can  have  any  desired  windows  displayed  In 
sequence . 


IV  PLANS 

The  future  of  the  Image  Understanding  Testbed 
program  at  SRI  will  be  closely  tied  to  the  SRI  IU 
research  efforts  and  to  the  evolving 
characteristics  of  Testbed  copy  systems  to  be 
installed  at  ETL  and  potentially  at  other  DMA 
sites.  The  generality  and  transportability  of  the 
IU  programs  and  utilities  will  continue  to  be 
enhanced  In  support  of  various  research  efforts. 
We  anticipate  that  the  Incorporation  of  Lisp 
Machines  Into  the  environment  will  result  In  a 
substantial  movement  toward  LISP-based  IU 
application  programs. 

The  major  shift  In  emphasis  In  the  Testbed 
environment  at  SRI  will  be  from  low-level  Image 
processing  code  towards  Increasing  reliance  on 
rule-based  expert  systems  to  control  the  choice  of 
low-level  processes,  the  parameters  to  be  used,  and 
Lhe  communications  interfaces  between  the  computer 
system.,  and  the  human  analyst.  We  anticipate 
development  of  a  substantial  capability  for 
supporting  expert  systems  that  facilitate  the 
application  of  IU  research  results  to  cartographic 
problems. 


